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*vj ■ The paper considers a variable length Markov chain model associ- 

ated with a group of stationary processes that share the same context 
tree but potentially different conditional probabilities. We propose a 
new model selection and estimation method, develop oracle inequal- 
ities and model selection properties for the estimator. These results 
also provide conditions under which the use of the group structure 
can lead to improvements in the overall estimation. 

Our work is also motivated by two methodological applications: 
discrete stochastic dynamic programming and dynamic discrete choice 
models. We analyze the uniform estimation of the value function for 
j^ , dynamic programming and the uniform estimation of average dy- 

namic marginal effects for dynamic discrete choice models accounting 
for possible imperfect model selection. 

We also derive the typical behavior of our estimator when ap- 
plied to polynomially /3-mixing stochastic processes. For parametric 
^ I models, we derive uniform rate of convergence for the estimation er- 

ror of conditional probabilities and perfect model selection results. 
For chains of infinite order with complete connections, we obtain ex- 
plicit uniform rates of convergence on the estimation of conditional 
probabilities, which have an explicit dependence on the processes' 

t^^ ' continuity rates. 

>— ' , Finally, we investigate the empirical performance of the proposed 

method in simulations and we apply this approach to account for pos- 
sible heterogeneity across different years in a linguistic application. 
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1. Introduction. The dependence structure of stationary processes are 
5^ \ of fundamental importance to understand the main features of the corre- 

sponding dynamic models. However, in many applications, the dependence 
structure is unknown. Not surprisingly, the development of models and 



'First Version: June 2010; Current Version: July 4, 2011. 

^Supported by projects Universal and ProduUvidade em Pesqutsa from CNPq, Brazil. 
This work is part of USP project "Mathematics, computation, language and the brain". 

AMS 2000 subject classifications: Primary 62M05, 62M09, 62G05; secondary 
62P2060J10 

Keywords and phrases: categorical time series, group context tree, dynamic discrete 
choice models, dynamic programming, model selection, VLMC 

1 



2 BELLONI AND OLIVEIRA 

model selection procedures for the estimation of the dependence structure 
have attracted substantial attention. 

Variable length Markov chains (VLMCs) have emerged as a prominent 
model class for stationary processes by allowing a flexible and parsimonious 
representation of the dependence structure. The estimation of the depen- 
dence structure and the associated conditional probability distributions have 
been made possible by exploiting the intrinsic hierarchical tree structure. 
Importantly, despite of the exponentially large dimensionality of the model 
selection problem, efficient computational methods have been successfully 
developed starting with the seminal work of [23] proposing the Context al- 
gorithm. 

The tractability and applicability of this model attracted considerable 
interest in different fields like statistics, information theory and machine 
learning. Subsequently, several papers have proposed various estimators and 
developed their statistical properties. [8] studies consistency when the un- 
derlying model increases in dimensionality as the sample size increases. They 
also proposed and shown the validity of a bootstrap scheme based on fitted 
VLMCs. [15] considers processes with infinite dependence for which there 
exist "good" context tree approximations. They established new results on 
a sieve methodology based on an adaptation of the Context algorithm. BIC 
Context Tree algorithm and its consistency properties have been considered 
in [12], [14], [17] and [27]. Redundancy rates were studied by [13] and [17]. 
Typicality results were established by [11]. Several other works contributed 
to this literature in various directions [6], [7], [18] and others. 

In this work we consider a group of stationary processes over a discrete 
alphabet. These stationary processes share the same dependence structure 
but possibly different conditional probability distributions across groups. We 
refer to this model as group context tree alluding to the recent literature on 
group lasso [21, 28]. As in the case of group lasso, by combining different 
processes which possess the same dependence structure we hope to improve 
the overall estimation. Interestingly, several data sets obey this model in 
many applications: in linguistic texts from different issues are used [16], in 
biology genetic sequences from different subjects [26], and others. 

We propose a new estimator for model selection and estimation of condi- 
tional probabilities for a group context tree model which is based on confi- 
dence intervals. Intuitively, the length of a valid confidence interval is closely 
related to the variance in the estimation of a given group context tree, and 
therefore can be used as a regularization penalty. The bias of a model is 
related to the continuity rates of the transition probabilities, that is to say, 
to their approximability by Markovian models of finite order [19]. Bias and 
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variance are automatically balanced by a simple tree-pruning procedure that 
is reminiscent of Rissanen's original Context algorithm. 

We study several statistical properties of these estimators. We develop 
oracle inequalities, model selection properties, and typicality results. The 
analysis derives finite-sample bounds for the estimation errors in various 
norms. We show that the group context tree model can lead to improvements 
on the estimation when compared to the single-process case. Moreover, two 
salient features of our results seem to be new even in the single-process case: 

(i) The estimated transition probabilities are uniformly close to their true 
values whenever the latter are continuous functions of the infinite past. 
This turns out to be important in the applications discussed below, 
(ii) We obtain finite-sample bounds that account for the possible misspec- 
ification of the estimated model in a transparent way. 

In addition, two new methodological applications motivated our investi- 
gation of the group context tree model. The first is the discrete stochastic 
dynamic programming problem in operations research [22, 24, 25]. In this 
problem a decision maker chooses among actions that yield some instanta- 
neous reward and impact the system's transition to the next state. That is, 
the transition probability distributions between states depend on the his- 
tory of states and on the choice of action. Thus we have different processes 
for each action. Also, computational methods used in dynamic programming 
rely on having the same history of states across actions. The second method- 
ological application is the dynamic discrete choice model in economics [1, 4]. 
This model consists of many different agents making choices over time. In 
order to account for heterogenous agents, it is of interest to allow agents to 
have different conditional probability distributions at the same context but 
it is reasonable to expect that agent rely on the same context tree. 

We also perform the analysis of the impact of possible misspecification and 
estimation errors from the group context tree in our two motivating appli- 
cations. In these applications, the objects of main interest are not the condi- 
tional probabilities but rather functionals of them. In the discrete stochastic 
dynamic programming problem the object of interest is the value function 
over the history of states, which is defined as a fixed point of a suitable op- 
erator that depends on the transition probabilities. We derive uniform error 
bounds for the £oo-norm between the fixed point computed based on proba- 
bility estimates of the probabilities and the true value function. In discrete 
dynamic choice models several statistics associated with the consumer be- 
havior of the agents are of interest. We focus our attention on the estimation 
of the average marginal dynamic effects. We derive uniform bounds on the 
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rate of convergence for the estimates of all the average marginal dynamic 
effect accounting for the model selection and possible misspecification. 

Lastly, we discuss the typical behavior of our estimator when applied 
to polynomially /3-mixing stochastic processes. Based on these results, two 
particular cases are worked out in detail. For parametric models, we derive 
uniform rate of convergence for the estimation error of conditional proba- 
bilities and perfect model selection results, essentially improving upon the 
assumptions of [8]. For chains of infinite order with complete connections, 
we obtain explicit uniform rates of convergence on the estimation of con- 
ditional probabilities, which have an explicit dependence on the processes' 
continuity rates. 

The paper is organized as follows. Section 2 formally introduces the ap- 
proximate group context tree model and the proposed estimator. Section 3 
contains the statistical analysis of the estimator. Section 5 is devoted to the 
two motivating applications. New typicality results for /3- mixing processes 
are developed in Section 6. Simulations illustrating the performance of the 
estimator are presented in Section 7. Section 8 applies the group context 
tree model to understand the difference between the rhythmic features in 
European and Brazilian Portuguese accounting for heterogeneity. Finally, 
proofs are deferred to the Appendices. 

2. Approximate group context tree. Let us introduce the notation 
and definitions. Let A denote a finite alphabet, A^j. denote all A-valued 
sequences with length k, and A* = AZ^o U U^g^-fc- '^^^ length of a string 
w is denoted by \w\ and, for each I < k < \w\, wZj^ is the suffix of w with 
length k. A subset T C A* is a tree if the empty string e £ T and for all 
w = w_\yj\ . . . w^i G T\{e} its suffix wl. i , -^ = W-|,«|+i • • • W-i, called the 

parent of w and denoted by par{w), is also in T. An element of a tree T that 
is not the parent of any other element in T is said to be a leaf of T. For 
w,w' G A*, we write w ^ w' if w is a suffix of w' . We associate with each 
tree T and each x = . . . x_3X_2X_i G ^Iqo ^ (possibly infinite) number 

Kj,{x) = sup{k G N : xl^ G f }. 

We will write T(x) = xZj^^/ \.The strings of the form T(x) where x ranges 
over AZ^ will be called the terminal nodes of T. 

Remark 1 (Complete context trees). It is sometimes convenient to work 
with complete trees, i.e. trees T where all non-leaf nodes have exactly \A\ 
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children. For a complete tree we have that ifx,y G AZ^o 

It is not hard to show that the same result will not hold for incomplete 
trees; consider for instance A = {0,1}, T = {e, 0,1,00}, x = ...0010, 
y = ... 0000, f{x) = 0, and f{y) = 00. 

In this paper a pair {T,p) will always correspond to a tree T and a map- 
ping p that assigns to each terminal node t; of T a probability distribution 
^(•1^;) over A. The set of probability distributions over A will be denoted by 
A"^. A stationary ergodic process X = {Xn)n&z will be said to be compatible 
with (T,p) if: 

F{Xo = a\ Xzl^) = p{a \ r(Xr^)) almost surely. 

If X is compatible with {T,p), we say that T is a context tree for X. We 
will also sometimes write p{a \ x) instead of p(a | T{x)). There always exists 
a minimal complete context tree for X and we will always implicitly refer 
to that tree. 

Finally, for two sequences a„, 6„ we denote a„ < 6„ if a„ = 0{hn)- The 
indicator function of an event E is denoted by xe^ and for g > 1 the || • \\h,q- 
norm of a vector v G M is defined as 



Mh,q 



2.1. Approximate group context tree model and oracle estimator. In an 
exact group context tree model we have L stationary processes, a context 
tree T*, and probability distributions p£, H. = 1,...,L, such that the £th 
process is compatible with {T* ,p() for i = 1, . . . ,L. Note that T* is possibly 
infinite so that this is not a restriction. 

However, for a given sample, T* might be too long to be efficiently esti- 
mated. Thus, in some cases, it is possible that a smaller tree, that is slightly 
misspecified, lead to much more efficient estimates for the conditional prob- 
abilities than T* would due to the large variance. This motivates us to 
consider an oracle context tree that balances bias and variance instead of 
T* as our main goal for estimation. 

In order to define an approximate context tree model consider a metric 
di : A X A —7- [0, 1] for each process, £ = 1, . . . ,L. For notational con- 
venience we denote for z,z' G (A*)^, p{-\z) = {pi{-\z{l)), . . . ,pi,{-\z{L))), 
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d = (di, . . . ,di,), and 

d{pi-\z),p{.\z')) = {di{pii-\z{l)),pi{-\z'{l)), . . .,dUPL{-\ziL)),pU-\z'iL))). 

By abuse of notation, we also apply the definitions above for z G A* simply 
meaning z = z{l) = • • • = 2;(L). 

For every £ = 1, . . . , L, we observe a sample X'^{i) = {Xi{i), . . . , Xn{i)) 
from X. Given w £ A* and fc > |w;| + 1, we let N};^(^{w) denote the number of 
indices i,\w\ < i < k, with X^_,; (i) = w, that is, the number of occurrences 

of w in X^ {£) . For notational convenience we assume the length of the sample 
for each group is the same but the analysis does not rely on that. 

Based on the sample, define the "oracle conditional probability" given a 
context w, a G ^4, ^ = 1, . . . , L as 

1 " 

if mini<^<L A'!n-i,£(ty) > 0, and we define p^g{a\w) = 1/\A\ otherwise. 

The conditional probability distribution Pn ei'l'w) will play the role of an 
oracle estimate for the conditional probability pe{-\w) which is adapted to 
the given sample. Thus pn is an intermediate step in the estimation of our 
ultimate goal p. Indeed, under mild regularity conditions it follows that 
p^ £{a\Xzl{-^)) converges to p£{a\Xzlo{^)) as k and n grow at appropriate 
rates. For each context w € A* , we denote the approximation error of using 
Pni-\w) as an approximation for the underlying conditional probabilities as 

(2.1) c^:= sup ||d(p(-|z),p„(»)||Lfc : 2W^u;,l<£<L. 

Given a "confidence radius" 

cf{w) = {di{w),. . . ,cfi,{w)) for each w e A* , 

the context tree for the approximate model solves the following oracle prob- 
lem for some k > 1 and r > 1: 



(2.2) mm sup c^. s + cf(T(x)) 



L,r 



where the minimum is over all finite complete trees. The choice of k and r 
typically depends on the final purpose of using the model (see Section 5 for 
examples) . 
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The context tree T that solves the oracle problem (2.2) balances the 
bias of a misspecified model and the variance associated with its estimation 
measured as a function of the confidence radius. We discuss a suitable data- 
driven choice for cf in Section 4. For now, we just note that in general cf 
should depend on the metrics d^, i = 1, . . . ,L, and we assume that cf{w) 
is componentwise increasing in w, that is, cf£{w) < cf^(w') ii w' ^ w . For 
convenience we also assume that < cf i{w) < 1 for all w G A* . 

Remark 2 (On the oracle problem, non-uniqueness). The oracle prob- 
lem (2.2) might have multiple solutions. Although the results derived here 
allow for any such solution to be considered, the oracle further selects a con- 
text tree by fixing the paths which achieve the optimal value of the oracle 's 
objective function and further minimization of the criterion function over 
the remaining paths. 

Remark 3 (On the oracle problem, approximation error). Under mild 
conditions on the processes the oracle can adjust the length of the contexts 
in the oracle tree to make the approximation error c^m to be (at most) of 
the same order as the regularization term, namely, there is a constant K 
such that uniformly in x G ^loo '^^ have 

(2.3) CT(.)<i^||cf(r(x))||L,,. 

2.2. Model selection and estimation of conditional probabilities. Next we 
discuss the model selection method which leads to the estimation of the con- 
ditional probabilities. The method relies on the length of confidence inter- 
vals associated with each suffix as a regularization term. These regularization 
terms are used to decide if a particular node should be pruned from the tree. 
After selecting a context tree, probabilities compatible with it are computed. 
Next we describe in detail the procedure. 

For each w & A* with min Nn-i Aw) > 0, we define: 

1<£<L 

Pn,i{a\w) = —-^ — -— , for a G ^, £ = 1,...,L 

as an empirical estimate for the conditional probability distribution p[a\w). 
For definiteness we set Pn/[o\'^) = l/l^l if ^^i<i<L ^n-i/iw) — 0- 

Let En be the suffix tree that contains every string w ^ A* such that 
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mill Nn-i/{w) > 0. For a fixed constant c > 1 define: 

(2.4)"'^ 

1, if for all w'jw" G En with w :< w',pai{w) :< w" 

l|d(PnO'),PnO"))llL,fc<c||cfK)llL,r+c||cfK)llL,.; 

0, otherwise. 



CanRmvfzi;) = < 



The constant c is similar in spirit to the slack parameter used in [3]; 
we recommend c = 1.01 in practice. Intuitively, CanRmv(ti;) = 1 only if 
removing w from the tree will not make a significant impact in the estimation 
of conditional probabilities relative to the noise in their estimation. We prune 
En into a tree T„ via the pruning algorithm in Figure 1 to obtain our estimate 
for the oracle context tree T defined as the solution of (2.2). 

Procedure PruneTree 

1: Tn •<— En. t> In the beginning, Tn contains all visible strings 

2: for each node w of En do 

3: exam(i(;) •<— > All nodes start out unexamined 

4: end for 

5: while 3 leaf w £Tn with exam{w) = do l> While there are unexamined leaves 

6: if CanRmv(i(;) — 1 then > If w can be removed, remove it. 

7: fn ^ fn\{w} 

8: end if 

9: exam(iu) •<— 1 > w has been examined 

10: end while 
11: return Tn 

Fig 1. The pruning algorithm. 

After selecting the model T„ we proceed to estimate the conditional proba- 
bility distributions. Given x G ^loo ^^ '^ill assign a conditional probability 
Pn{-\x) which is compatible with Tn, as follows: 

Pn{-\x) = Pn ( ■\fn{x)) = Pn ( -jxl), 



-Kf (x) 

Remark 4 (Complete context trees, continued) . In general, Tn will not 
he complete in the sense of Remark 1. Completeness can always he achieved 
hy adding leaves nodes to non-leaves nodes of the tree created by the algorithm 
has finished. In that case, the conditional prohahilities for the added leaves 
are set to the conditional prohahilities of their corresponding parent which 
was not pruned hy the algorithm. 

Remark 5 (Variations of CanRmv). All results we will establish in the 
following sections would remain true if in the definition of CanRmv(^u;J in 
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(2.4) we set w" G W where par(tt;) £ W (^ {z £ E^ '■ z >: pai{w)}. For the 
same choice of confidence radius, computationally we would like to use the 
smallest set W while statistically we would like to use the largest such set. 

Remark 6 (Computational efficiency). The algorithm can be imple- 
mented efficiently, i.e. in polynomial time with respect to the parameters 
L and n of the data. Observe that CanRmv(w) can be computed efficiently 
from the list of values: 

List(tt;) = {{pn{-\w'),di{w')) : w' G £„, w' h w} 

and the corresponding list for par{w). Since CanRmv(w) is only computed 
for leaves of the current tree Tn, we only need to ensure that at all times, 
each leaf node and each parent of a leaf stores the correct list List(u;). This 
can he achieved as follows. 

• initially, one sets 'L\st{u}) = {(]5„(-|w), cf(ty))} for each w G En; 

• whenever a leaf w is examined in Tn, its parent's list is updated: 

List(par(w;)) <r- List(par(tt;)) U M List(ti'') 

\w'£T„ : par(w')=par(w) 

Actually, this update only needs to be performed at the first time a 
child of par{w) is examined. 

We note in passing that more efficient algorithms can be found for the case 
L = 1 with the i^o metric by using compact suffix trees. This will be elaborated 
upon in a companion paper. 

3. Analysis. In this section we derive our main theoretical results on 
the performance of the estimates proposed in Section 2. We start by charac- 
terizing an event which ensures that the pruning and estimation procedures 
behave properly. Then we proceed to establish model selection results and 
oracle inequalities for the estimation of the conditional probabilities when 
using the algorithm described in the previous section. 

3.1. Regularization event. The sources of the estimation error are the 
deviation oi pn/{-\w) from the oracle conditional probabilities Pn/i^w) and 
the approximation error of using the latter instead of pi. Under this time 
series framework, the key issues are the need to simultaneously estimate 
these probabilities for a number of suffixes w £ A* that is substantially 
larger than the sample size n, and to select one model (i.e. a context tree) 
among the exponentially many possible models. 
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The underlying idea will be to rely on some regularization which would 
prune from the estimated tree suffixes that contain a large noise relative 
to the explanatory power they could bring to the model. Hopefully, this 
would reduce the number of suffixes in the estimated context tree leading to 
consistent estimates of the conditional probabilities. Our main insight is to 
rely on the length of a confidence interval for pn/i'lw) as the main ingredient 
for the regularization penalty. 

The regularization penalty in the pruning algorithm consists of a confi- 
dence radius cfi{w) for each suffix w £ A* and process £ = 1, . . . , L, so that 
with high probability the following event occurs for a prescribed m > 1: 



Good„ 



n { 



weA* 



df. {P „,t{-\w),Pn,l{-\w)) ' 

de{w) 



1=1 



< 1, 



if min Nn^i i(w) > 

1<£<L 



By the definition of |HIl„, if t?£(Pn^("l^)'Pn/("l^)) ^ cf^(^) for every 
£ = 1, . . . ,L, the event Goodm occurs but it is also likely to occur if this 
condition holds for most values of i. For efficiency reasons we will aim to 
choose the smallest confidence radius such that 

P (Good^) >l-6 

where 1 — (5 is our desired confidence level. In Section 4 we will propose and 
analyze data-driven choices for the confidence radius so that Goodm occurs 
with high probability. 

The following condition summarized our setup. 

Condition 1 (AGCT). We have data {Xf(^) G A'^,i = 1,...,L} 
that for each n obey the model in Section 2, with T defined by (2.2) where 
^^(.{w) < 1 is componentwise increasing in w G A* , for every 1 < ^ < L. 
The approximation error cx[x) ^•s defined as in (2.1). The positive (extended) 
integers k, r and m satisfy k <m and r > km/{m — k) (equivalently k < r, 
m > kr/{r — k)), or k <r = m = oo. 

The event Goodm will imply many desirable properties of the estima- 
tors including oracle inequalities. The following result relates the criterion 
function with the regularization term used in the oracle under the event 
Goodm- 

Theorem 1. Suppose Condition AGCT holds and that the event Goodm 
occurs. Then, for all w £ A* , we have 



\\d{pn{-\w),Pn{-\w))h^ < ||cf(u;)||. . 
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Thus, if Goodm occurs, up to the regularization, |5„(-|w) can be used as a 
good approximation for the oracle condition probabihties pni'lw) uniformly 
over w & A*. An immediate corollary for the estimation of the true condi- 
tional probabilities follows by triangle inequality. 

Corollary 1. Suppose Condition AGCT holds and that the event Good„ 
occurs. Then we have for all x € AZ^o 

\\d{pn{-\T{x)),p{-\xmL,k < CTi.) + ||cf(T(x))||L,,. 

In words, this corollary establishes that pn{-\T{x)) is also a good approx- 
imation to p{-\x) itself if the event Goodm occurs. 

Up to this point, the parameter c > 1 used in the algorithm proposed in 
Section 2 has not played a role. The role of c will be to allow us to extend 
the nice properties established above for the oracle context tree T to the 
estimated context tree T„ accounting for the model selection which can lead 
to a misspecified estimated model. 

3.2. Model selection results. Next we derive properties of the estimated 
context tree T„ relatively to the oracle context tree T and the (possibly infi- 
nite) compatible context tree T* of the processes. Recall that as mentioned 
in Section 2, we have T (^ T*. The next result addresses a similar question 
for the estimated context tree T„. 

Theorem 2. Suppose Condition AGCT holds, and that the event Goodm 
occurs. Then we have Tn^T* . 

In particular. Theorem 2 shows that by a suitable choice of the confidence 
radius cf we have that the estimated context tree does not overestimate a 
compatible context tree T* with high probability. 

As discussed before, a compatible context tree might be too long and 
might not lead to the most efficient estimation of the conditional proba- 
bilities. This is the underlying motivation for the oracle context tree T. It 
balances bias and variance to achieve a good estimation performance. Thus 
a more interesting question is how does the estimated context tree r„ com- 
pare with the oracle context tree T. Although some branches of T„ can be 
longer than the corresponding branches of T, we can show that they cannot 
be too much longer when measured in the regularization function. 

Theorem 3. Suppose Condition AGCT holds, T is a complete tree, and 
the event Good^ occurs. We have that for all x G AZ^o 

cf(r„(x)) <max{||cf(r(x))||L,„ '^-^ - \\d {T {x))\\A . 

L,r [^ ' C — i J 
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This result establishes an oracle inequality for the regularization term of 
the leaves of T„. The slack parameter c > 1 combined with the Goodm event 
allows to restrict the ||-||L^-size of T„. In particular, in the typical case in 
which the oracle context tree yields cx(x) ^ ll<^f(^(^))llLr' ^^^ inequality 
above achieves cf{Tn{x)) ^ ||cf(T'(x))||L r- 

It is possible to derive bounds on the actual length of Tn{x) relative to 
T{x) by relying on the following regularity condition. 

Condition 2 (RL). There is a k £ (0, 1) and integer k >1 such that 

k\og\\cf(T(x))\\-r 



Condition RL is similar to the modulus of continuity between the regular- 
ization penalty and the length in a neighborhood of T{x). It turns out that 
such condition is satisfied for many designs of interest. Under Condition RL, 
we can establish the following result. 

Theorem 4. Suppose that Conditions AGCT and RL hold, and that the 
event Good^ occurs. Then for all x G ^loo '^^ have that 

\fn{x)\ < \T{x)\ + , t, , max J 0, log ' ^ ""^^"^ 




log(l/K) ( ' "\^c-l||cf(T(x))||L,, 

It is interesting to consider the result of Theorem 4 when (2.3) holds. In 
this case we have 

\fn{x)\<\T{x)\+ maxjO, logi^ + log (2/(c - 1))} . 

log(l/K) 

Moreover, under mild conditions on the process, k is bounded away from 
zero and k is bounded above uniformly in n. Therefore, these regularity 
conditions imply that the length of Tn{x) is not larger than the length of 
T(x) plus a constant factor. 

3.3. Oracle inequality for conditional probability estimation. Next we fo- 
cus on the estimation of the conditional probabilities. In particular our goal 
is to derive uniform bounds over sequences x G ^Iqo between the true 
conditional probability distributions p{-\x) and their estimate P„(-|x) = 
Pni'\Tnix)). Our main result is the following oracle inequality. 
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Theorem 5. Suppose Condition AGCT holds and that the event Good^ 
occurs. Then for all x G ^Zqo '^^ have 



d(P„(-|x),p„(-|r(x)))||^^<maxUl + 2c)||cf(r(x))||L,„ ^c^m 



and 



d{Pn{-\x)M:\x)) 



< 



L,fc 



CT(.)+max (l + 2c)||cf(T(x))||L, 



c+1 



1 



Ct{x) 



The combination of the event Goodm and the parameter c > 1 in the 
algorithm allows to derive finite-sample guarantees on the estimation per- 
formance of the conditional probabilities. Furthermore, in the typical case 
that ct(x) ^ l|cf(2^(2;))||L r-) under the conditions of Theorem 5 we have 



diPni-\x),p{-\x)) 



< 



L,fc 



|cf(T(x) 



\L,r 



recovering the same performance (up to a constant factor) of the oracle 
estimator. 



Remark 7. The results in Theorem 5 hold for any tree T and not only 
the oracle tree T. In fact, by plugging in a compatible context tree T* it yields 



diPni-\x),p{-\x)) <(l + 2c)||cf(r*(x) 



L,fc 



lL,r- 



4. Data-driven choices for confidence radius. In this section we 
propose and analyze data-driven choices for setting the confidence radius 
cf{w) for all w € A* . Intuitively, the confidence radius should majorate 
the deviations of p^/ from pn/ which is the noise in our model selection 
problem. Large confidence radius overcome the noise easily but introduces a 
large bias. Small confidence radius would not allow for consistent estimation 
due to the large number of potential models. 

By definition of the event Goodm, a proper choice of confidence radius cf 
will depend on the metrics di and on the choice of m. We will consider in 
detail the (extreme) cases that are the relevant ones in the methodological 
applications in Section 5: m = 2 and m = oo. Regarding the choice of 
metrics, here we will consider de = \\ ■ ||oo for every £ = 1, . . . ,L, and dg = 
II • ||i/2 for every i = 1, . . . ,L. An unifying approach to work with these 
metrics consists of choosing an appropriate family of sets 5 C 2 to write 
them as 

di{pi,pe.) = sup \piiS) -pi{S)\ 

S€S 
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where || • ||oo and corresponds to S = {a G A} and || • ||i/2 corresponds to 
5 = 2^. 

Each set 5 G 5 induces a martingale so deviations between pn/{S\w) 
and pn,£iS\w) can be controlled by martingale inequalities developed in Ap- 
pendix E. Several other combinations can be derived directly from the pre- 
sented analysis. Nonetheless, the cases covered here will allow us to compare 
interesting features of group context tree models relative to the traditional 
(single-process) context tree model. 

In order to clearly communicate the main results we simplify the constants 
in the data-driven choices in the main text. In the appendix we state precise 
results that share the same asymptotic rates. For notational convenience 
define a function, for w G A*, e E (0, 1), ^ = 1, . . . , L: 



ceiw,e) -.= 2. \ ./log(l/e) + 21og(2 + 21ogiV„_i,^H). 

y Nn-1/{W) V 

The value of e is set to account for the model selection and the confi- 
dence level 1 — 6 of the event Goodm occurring. The loglog A^„_i^£(it;) factor 
emerges from the time dependence in the data. In the traditional single- 
process case, L = 1, e is chosen so that logl/e = O (log{n/5)). This is 
similar to the rate in the case of Goodoo for the group context tree. 

Theorem 6. Let 6 £ (0, 1), and for w G A* with mini<^<L Nn-i/{w) > 
0, i = l,...,L, let 

Then, setting the confidence radius to be ci£{w) = uim{cfi(w), 1}, with prob- 
ability at least 1 — 5 the event Goodoo occurs. 

Thus, the rate for cf^ (w) is ■s/log{nL/d)/Nn~i/{w) recovering the rate of a 
single group in the typical case that logL < logn. However, for other choices 
of m, it is possible to improve upon that rate by exploiting the context 
tree estimation across different groups. The following result addresses this 
question for m = 2. 

Theorem 7. Suppose Condition AGCT holds. Let 5 G (0,1), and as- 
sume that for some a < 3 we have log (n^L/(5) < aL log logn, and for 
w e A* £ = 1,...,L, let 



V 4|5|log (n^L/(5)/ U V logn J 
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Then, setting the confidence radius to be cii^w) = ioam{cfg{w), 1}, the event 
Good2 occurs with probability at least 1 — 6. 

In this case, if log^ {n/6) = o(L) , the rate of cf ^ (w) is y^log log n/Nn-i/{w) 
improving upon the single-process case. This is remarkably close to the error 
in the estimation of probabilities if the oracle model was known in advance. 

4.1. Improvement based on maximal variance. The choices above do not 
explore the intrinsic variance within of the norm, namely 

d-j{w) := max pn/iS\w){l - Pn,e{S\w)). 

Although this factor does not affect the rate in general, in finite-sample it 
can yield an important factor to avoid over regularization. This is particulary 
evident in the case of d^ = || • ||oo with \A\ > 2. 

Our second data-driven choice will account for that to generate strictly 
smaller confidence radius with the same guarantee. However, the variance- 
based control can be applied to a suffix w only if there were enough occur- 
rences of the suffix in the data, namely the following event occurred 

, j .J f , ^ 2login^\S\/6)+Alog[2 + 2log{aj{w)Nn-i4w))] \ 

y af{w)\og\^/2) J 

When this is not the case, we can still the previous choice cf^ {w). To concisely 
state the results regarding the maximum variance we define 

ai{w) := V2at{w)x{j,^^} + X{JiJ < 1- 

Theorem 8. Suppose Condition AGCT holds, choose S £ (0, 1), and let 
m = 2 or m = oo. For m = 2 suppose further that assumptions of Theorem 
7 hold. For w £ A*, (. = !,..., h set 

d1{w) := ae{w)dj{w) 

where cf £ (w) is chosen as in Theorem 6 if m = oo and as in Theorem 7 if 
m = 2. Then, setting the confidence radius to be cf£{w) = min{cf^(tt;), 1}, 
we have that with probability at least 1 — 6 the event Goodm occurs. 

By construction, it follows that cfj{w) < ^^{w) since ai{w) < 1. However, 
cf^(w) might not be non-increasing in w. Nonetheless, the confidence radius 



16 BELLONI AND OLIVEIRA 

cf^ can be maj orated by the monotone confidence radius which still leads 
to an improvement over cf , namely 

d}{w) = maxcf£(tf') < maxcf^(w') = cf|(tf). 

A side remark is that cf|(w) requires the estimation of (T(^{w). It follows that 
any such estimator will satisfy &e{w) < 1/2 still achieving smaller confidence 
radius than cf . However, the estimates need to satisfy ^i{w) < (Ti{w) with 
high probability. 

5. Methodological applications. In this section we develop two ap- 
plications of the approximate group context tree model and estimation al- 
gorithms. In both cases the main object of interest is not the context tree 
nor the conditional probabilities but functionals of these quantities. In what 
follows we estimate these functionals based on r„ and P„ accounting for the 
estimation error and possible misspecification. These two applications rely 
on different metrics and penalty functions providing a motivation for the 
generality of the previous analysis. 

5.1. Discrete stochastic dynamic programming. Discrete stochastic dy- 
namic programming focuses on solving structured optimization problems 
in which a control u is chosen from a set of discrete options U at time t 
and yields some instantaneous payoff f{a,u) where a £ A is the current 
state. The system evolves to a state xt+i at period t + 1 according to a 
^-valued random function s(x^_^,u) which transition probabilities depend 
on the chosen control u £U and (potentially) the complete history of states 



xl^ G ^-oo- ^^ applications, the main object of interest is the value func- 






tion that characterize the expected future payoffs as a function of the history 

of states: 

V{x) = max{/(x_i, u) + /3E [V{x s{x, u))]} 
ueU 

where /3 < 1 is the discount factor and xs{x,u) is the concatenation of x 
with s{x,u). 

The success of this approach relies on avoiding to use the complete state 
space ^loo- The selected state space needs to be rich enough to capture 
the main features of the transition function s{-,-). However, in practice the 
transition probabilities between states need to be estimated, and algorithms 
to compute the value function can suffer from a curse of dimensionality if 
the state space is large. 

Thus, our motivation to apply the methods described in the previous 
section is two fold. First, to create estimates for the transition probabilities. 
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Second, to create a data-driven manageable state space. This is exactly the 
case in which using the AGCT model can be more attractive than using 
a substantially larger compatible tree T* . We advocate in favor of a small 
approximation error (that is comparable with the noise in the estimation) 
with a substantially smaller state space. Thus, for x G ^Iqo) we propose to 
estimate the value function with 

V{X) = V{fn{x)), 

and the transition probabilities with Pn,u{- \ Tn{x)) = Pn,u{- \ x)^ which are 
allowed to depend on the action u eU. Thus the total number of states of 
the system (suffixes) is the number of leaves of r„. 

The following result derive guarantees on the estimation error between the 
value function V and its estimator V that solves the fixed point problem 
defined with the estimates of the conditional probabilities Pn,ui' \ x). 

Lemma 1. In the discrete stochastic dynamic programming problem de- 
scribed above, for q > 1 the estimator V satisfies 

max IVi-^-yJ-il < f> 



^e^:L \\Vix-)\\^!_max\\Pn^u{-\x) - pu{-\x)\\g 1-/3 
where q/{q — 1) = oo if q = 1. 

The lemma above allows us to apply the results on the estimation of the 
conditional probabilities to the stochastic dynamic programming problem. 
Let the number of groups L = |ZY|, d^ = || • ||i/2 and k = r = oo. 

Theorem 9. In the discrete stochastic dynamic programming problem 
described above, by choosing cf as in Theorem 6, we have that with probability 
at least 1 — 5 the estimator V of the value function satisfies 

\Vix)-V{x)\ 2/3 / „ 2c ct(^) 

max ' \ ! ; ;' < — -- 1 + 2c + sup ^ ' 



,^Azl ||cf(T(x))|lL,^||F(x.)||oo - 1-/3 \ c-1,,/. ||cf(r(:.))||L,^ 



, II f.^. ^^ll < log(nL/5) + loglogiV (T(x)) + |yl| 

where ||ct(i (^JJIIlq^ ^ max 



=i,...,L y iv,_i,^(r(x)) 

Recall that under mild conditions the oracle balances bias and variance 
so that there is a constant K such that 

sup ,, ,,^, ',, — < K. 

^^/j ||cf(r(x))||L,, - 
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In this case, the rate of convergence is governed by the oracle regularization 
term ||cf(T(x))||L ^ which goes to zero as the sample size increases. 

5.2. Dynamic discrete choice models. In dynamic discrete choice models 
a group of agents makes choices among the same set of options over time 
[1, 2, 4, 5, 9]. Typically, models pre-specify the Markovian structure of the 
process which is commonly assumed to be a Markov Chain of order 1. We 
are interested in relaxing this assumption and to estimate the relevant con- 
text tree and the associated transition probabilities. Agents are assumed to 
be sampled independently from the same population. We assume that the 
underlying context tree is the same across agents, but allow for the specific 
transition probability to vary by agent to account for heterogeneity. Herein 
we focus on the case with no covariates but results can be extended to the 
case of discrete covariates [4, 5]. 

In applications, the main interest is on statistics that are functions of 
the conditional probabilities rather than the conditional probabilities them- 
selves. Here we focus on the average marginal dynamic effect for a € A, 

AVEm(a, x,y) = E [rri^(a, x, y)] 

where the marginal dynamic effect mi{a,x,y) = pi{a\x) — Pi{a\y), and the 
expectation is taken over the distribution of agents in the population of 
interest. The average marginal dynamic effect measures the average over 
the population of the change in the probability of selection of an option 
a ^ A between two different histories of past consumption x,y G ^-oo- 
Other measures of interest in the literature are the long run proportions of 
a particular option being chosen, or the probability of selecting a particular 
option t periods ahead given the current state, see [4]. 

The estimator of the marginal dynamic effect for an option a G A and 
histories of consumptions x, y € AZ^ for the ^th agent is 

rhi{a,x,y) = Pn/a\Tn{x)) - Vn/{a\Tn{y)) , 
and the estimator for the average marginal dynamic effect is 

L 

L 



1 L 

AVEm(a, x, y) = - ^ 7fi^(a, x, y). 



This motivates the choice of d^ = \\-\\oo, k = 1, and r = m = 2 in the AGCT 
model. 
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Theorem 10. In the dynamic discrete choice model described above, 
if the context tree and conditional probabilities are estimated with cf as in 
Theorem, 7, we have that with probability at least \ — 26 the estimator for 
the average marginal dynamic effect satisfies 

|AVEm(a, a;, w) - AVEm(a, x, y)\ 4c^ 
max ^ \_j_^yj_ ^ ' '^^' < . 



where \\d{T{z))\\^^, < Ji2ii£IILH££^ J^ti l/N^-iATiz)), z G AZ 



oo ■ 



This uniform rate of convergence for the average marginal dynamic effect 
is governed by the rate of convergence of the conditional probabilities of 
the oracle estimator, and the number of different agents in the data. As 
mentioned in the previous section, in many models of interest there is a 
constant K such that 

sup ,, ,,^, ,,,, — < K 

,,/i ||cf(r(z))||L,2 - 

so that the rate of convergence of the conditional probabilities is governed 
by the regularization terms and ylogn/L. Interestingly, the above result 
holds uniformly over all pairs x,y G AZ^- The cost to attain this uniform 
rate that accounts for the model selection and the size of T„ is the \/logn. 

6. Bounds under primitive conditions. In this section we discuss 
finite-sample behavior of the AGCT estimator for families of processes sat- 
isfying certain conditions. One key ingredient will be to derive finite-sample 
bounds on the behavior of the confidence radius under appropriate mixing 
conditions. Our basic setup is summarized by the following assumption. 

Assumption 1 (Basic Setup). {-'^(^)t^}^=i are stationary stochastic 
processes with values in a finite alphabet A. We are given n G N with n >9, 
L £ N is the number of groups and 6 £ (0,1) is a confidence parameter. We 
write L = n"^ , 6 = n""* and 

a = ai, + as > as < 6 < 1. 

We let [Tn, Pn) denote the output of the AGCT estimator on input {X{€j^y^^^ 
with confidence level \ — 5, a metric d = dg, slack parameter c > 1 with the 
confidence interval described in Section 4- 
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In what follows, we will need additional notation in which Ci, . . . ,Cq > 
are universal constants (i.e., they do not depend on the processes, on their 
mixing rates, or on the parameters n, L, 5). Given w £ A* and k > 1, define 

c^= sup ||d(p(- I x),p(- I y))||L,fc :x:f^|(£) = yl|^^|(£) = u;, 1 <£<L. 

The approximation error Cw is more primitive than c^ which depends on the 
particular sample. This new quantity will allow us to make statements based 
only on the properties of the processes. Nonetheless these approximation 
errors are closely related and they satisfy Cw < Cw < 2c„,, for all w £ A*. 
Hence, for the purpose of rates of convergence they are interchangeable. We 
also define steady state probabilities 

Mw) = ¥ (^Wlf^i = u;) , 1 < ^ < L. 
For a complete finite tree T C A* , let /i^ denote the height of T and 

. < 'Ki{w) : w leaf of T, 7Te{w) > > > 0. 



TTji 



mm mm • 
1<^<L 



In order to derive primitive rates of convergence, we will control the data- 
driven confidence radius cf{w) which determines the size of the regularization 
term. We define non-empirical versions of the confidence radii cf (it;) by re- 
placing Nn-i/{w) with n7ri{w) in the definition of cfi{w). Given a complete 
finite tree T and r > 1, the following typicality event plays a central role: 



TyPr(T) 



sup 



cf(r(x)) 

L,r 


d{f{x)) 

L,r 



< 



C2 

logn 



> . 



6.1. General results under f3-mixing conditions. We recall the definition 
of /3-mixing. 

Definition 1. A process X_'^ with values in a finite alphabet A is said 
to be j3-nfiixing (or absolutely regular) if there exists a function /? : N — ;■ [0, 1] 
with limf,gN,6->oo /5(&) = and yk £ Z, s £ N: 



sup 

EcA'^ 



^fc+6+s-i (.E\Xi)-F (X,^t^^-i £ E 



^k+b 



p{b) > E 
The function /?(•) is called a (f3-)niixing rate function for X 



+ 00 

oo ■ 
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We will present general results under the assumption of polynomial /3- 
mixing. 

Assumption 2 (Beta Mixing). Assumption 1 holds and there are con- 
stants r > and 7 > such that the processes X{1), . . . , X(L) are f3-mixing 
with common rate function: 

P{b) = Tb-^ (6gN). 

Remark 8. Most of the literature on context-tree-based estimation as- 
sumes (j)-mixing or stronger conditions, cp-mixing consists of replacing the 
expectation in Definition 1 by an essential supremum, and can be shown for 
finite-order Markov chains and other examples. However, there are other 
natural examples (such as renewal processes) that are ^-mixing but not 4>- 
mixing. Ferrari and Wyner [15] make the alternative assumption of geomet- 
ric a-mixing. We will provide a more detailed comparison after stating our 
results below; for now we only note that this assumption, as well as some of 
their other assumptions, are incomparable with ours. 

For (5o € (0, 1) define the set of complete trees whose leaves are neither 
too rare not too long relative to the /3-mixing condition and the sample size 
n: 
(6.2) 

f f r A* Tr-i-^+'^) < Uh ni-'i. \ 

Ts = < J ^^ . "r - [Ci(l+a)(l+{6r)i/7)]7 log3^+i„' I 

" 1^ complete finite tree hf, + I < iTf n/[Ci (1 + a) log^ n] J 

The following result establishes the typicality event (6.1) is likely to occur 
for any tree in Tsq- 

Theorem 11. If Assumption 2 holds then 

min minP ('Typ^(r)') >1-6q. 

T&Tsg r>l \ J 

Among the complete trees in T^qj let T^ achieves the minimum of the 
following (restricted) oracle problem 



jnm sup I c^ 



(cf(,)+nmx|^c^(^),(l + 2c)||cf(r(x))||^J) 



Note that the criterion above is exactly the bound in the oracle inequal- 
ity developed in Theorem 5. Thus, T '^ is a context tree that yields good 
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conditional probability estimates within trees for which we can ensure that 
the typicality event (6.1) is likely to hold by Theorem 11. That directly im- 
plies primitive results about the estimates (T„,P„) provided by the pruning 
algorithm. 

Theorem 12. Suppose Condition AGCT and Assumption 2 hold, and 
let (5o S (0,1) he such that Tsg 7^ 0. Then assuming that Typ^.{T '^) and 
Goodm both occur, we have: 



sup 



d{p{-\x),Pr.{-\x)) 



L,fe 



< 1 



C2 



e^:LcT^o(.)+max{f±icj,.„(,),(l + 2c)||cf(r*o(:,))||^J V log'^ 
In particular, the event above occurs with probability at least 1 — 5 — 60. 



Next we provide model selection results. For trees in Tsq, we show that 
subtrees that have well-separated paths must be selected by the estimator 
T„ (Theorem 13) while trees with a small misspecification must contain the 
estimator T„ (Theorem 14). 

Theorem 13. Suppose Condition ACCT and Assumption 2 hold, and 
let T G 7^0 • Assume that T- (^ T is a subtree consisting of all nodes w £ T 
such that there exist x € ^loc V ^ ^-00 ^^^^ ^(^) ^ ^; ^iv) ^ par(tf) 
and: 



Mp{-\x),p{-\y))\\ 



L,fc 



> 



(1 + 1^)0 



+^T{x) + '^T{y) 



cf(T(x)) 



cf(T(2/)) 



L,r 



Then if Goodm and Typ^(T) both occur, we have that T_ '^Tn- In particu- 



lar, p ( r_ C r„ I > 1 



S-So. 



Theorem 14. Suppose Condition AGCT and Assumption 2 hold. Let 
T+ G Tsq be such that for all x G AZ^o 



c"f+(.) < 2 ( 1 



C2 
logn 



(c-1) cf(T+(x)) 



L,r 



Then if Goodm and Typ^(T+) both occur, we have Tn ^ r+. In particular, 
^f„, cf+'l >l-5-6o. 



We briefly indicate similarities and differences between the results pre- 
sented above with [15] which concerns the single-process case. The work 
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[15] proves weak consistency in the estimation of conditional probabilities 
and of (truncated) context trees for all nodes in a tree T^ that grows with 
the sample size n. For this they assume that the stochastic process is geo- 
metrically a-mixing, and also that there is sufficient separation between the 
conditional probabilities corresponding to leaves of the tree and their par- 
ents. The authors point out that their assumptions might be hard to check 
in practice. 

Our analysis differs from theirs in several important aspects. Our goal is 
to estimate transition probabilities given the entire infinite past, uniformly 
over all such pasts. Achieving consistency in our setting requires that these 
probabilities be continuous functions of the infinite past, which [15] do not 
need to assume. By contrast, given continuity and /3-mixing, model selection 
and probability estimation become separate tasks. In particular, our results 
on the transition probabilities do not require any kind of separation between 
leaves and their parents. In addition, our results covers natural and interest- 
ing classes of processes (such as certain renewal processes) where geometric 
mixing of any kind is not available and also the use of multiple processes. 
Other points of the analysis are mostly incomparable due to the differences 
in assumptions. 

6.2. Application to the parametric case. In this section we consider the 
behavior of our estimator in the setting where the processes ^(1), . . . , ^(L) 
have a compatible context tree that is finite. 

Assumption 3 (Parametric assumption). The processes -'^(l), . . . , ^(L) 
are stationary and ergodic. Moreover, there exists a complete finite tree T* 
and transition probabilities p = {pi, . . . ,pl) that are compatible with T* such 
that: 

VI < ^ < L, Va e ^ : P {X{e)o = a \ X{e)zi,) = Pii^ I T*{X{i)zi,)) «•«•• 

Moreover, each of these processes is stationary /3-mixing with the same ex- 
ponential rate function: 

j3{b) = X e"*^ where x, i^ > 0. 

We recall that any ergodic finite-order Markov chain is exponentially (j)- 
mixing, which is stronger than exponential /3-mixing. Our assumption re- 
quires that the exponential /3-mixing rates of the L processes be uniformly 
controlled. 
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Theorem 15 (Rates under the Parametric Assumption). Suppose Con- 
dition AGCT and Assumptions 1 and 3 hold simultaneously and 6% < n""*""^. 
Then there exist a constant C3 > depending only on\S\ such that if: 



(6.3) 
then 



^ , -. I a + l\ hr* + 1 

Cg (1 + a) 1 + —— I < 



n 



TTx* 



log n 



d(p(.|x),P„(-|x)) 

P I sup „_ 

',e^-^(l + 2c)||cf(T*(x))||L 



L,fc 



< 1 



_C2 

loeri. 



> 1-S-Sn 



If in addition 



imuiax<.\\d{p{-\w),p{-\w'))\\ru ■ w,w' leaves of T*, w'ypai{w)> >0 

w w' (. ' J 



then the extra condition 
2(1+ ^' 



W„/(1 + C) ™a^i Ff(T*(x))||L,, <4,: 



implies that 

( 

{Xn=T*}C^{ sup 



d{p{-\x),Pn{-\x)) 



L.fc 



V 



2;eA_ 



(l + 2c) cf(T*(x)) 



< 1 



L.r 



logn^ 



> 1-5 -6a- 



To discuss the results in Theorem 15 we focus on the case that a and c 
are constants but vr-r*, c^fc.min ^ 1- Theorem 15 establishes that in the choice 
of cf for Goodoo , we need 



n = 



1 



TTT* dl _;, 



■log 



1 



fc,min 



T^T* dl „;, 



fc,min . 



TTt* 



V <: -^ log^ 



Ht* 
TTt* 



in order to correctly recover T* with high probability. Moreover, the uniform 
rate of convergence of the estimation error of the conditional probabilities 
is bounded by y^logn/nT* n. By contrast, provided the number of groups L 
is sufficiently large, in the case cf is chosen to achieve Good2 one needs 



n = 



1 



TTT* dl 



log log 



1 



TTT* dl^:^ 



y<-^\og''^^' 



Ht* 



^A:,min \ " ^ ' "'fc,min/ J <^ " ^ ' \'^T* , 

in order to recover T* , and the uniform rate of convergence of the estima- 
tion error is bounded by y^log log h/itt* n. This indicates the advantage of 
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using the cf to achieve Good2 criterion when the number L of independent 
processes is sufficiently large. 

It is helpful to compare these results with the results on model selection 
for the parametric case for single-process case (L = 1) obtained in [8]. They 
considered models with compatible trees T^ of size increasing with the sam- 
ple size n under an assumption that implies geometric (/7-mixing. They show 
that T* is correctly recovered with high probability if 



of V^^^ \ ^^^^^i 



n 



W1/2+-- / ---T, 




where a, 6 > are constants and dx* is the minimum separation between 
the conditional probabilities corresponding to a leaf and its parent. The 
latter quantity is at most the maximum separation between leaves, which is 
our dfc^min- Moreover, since they also assume that no leaves of T* have null 
probability, so that T* has at most vr^, leaves. This in turn implies that T* 

has height hx* < vr^^. . Thus conditions in [8] imply: 



hT* 




f n \ 


1 




( n 




- (> 1 




and ^5 


- () 




TT-T* 


1 


Klog^+^^n) 


'^T* d^ -^ 




Vlogi+^n 



Those conditions match our own conditions up to the logarithmic terms. 
Thus the result presented here nearly implies the perfect model selection 
result in [8], even with our pessimistic estimates on /ij^* and c^fc.miii) and offers 
improvements in situations where hx^ji^T^ <C 1/ log^ n and/or dfc^min ^ <^T;t ■ 
Additionally, we also obtain improvements in the group setting which was 
not considered in the context tree literature before. 

6.3. Chains of infinite order with complete connections. We now consider 
processes which do not satisfy the parametric assumption, but which can be 
well-approximated by order-fc Markov chains. We focus on a particularly 
well-understood case where ^(1), . . . ,^(L) are known to be </)-mixing [10]. 

We first need some definitions. Let: 

Qmin = min Pe{a\x). 

i<e<L,aeA,xeAzl^ 

Also define the continuity rates: 

Xi{b) = l- min V _ inf _ piia\x) (6 G N\{0}, 1 < ^ < L). 
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Assumption 4 (Assumption of chains with complete connections) . As- 
sumption 1 holds, and assume qmin > and for some 7 > 2ql + 1 

Mb) <Tob-^-^' {ben). 

To see how the assumption relates to group context tree models, consider 
the "canonical approximation" of X{i) by a order-6 Markov chain, with 
transition probabilities: 

PMnix)) = P {Xii)o = a I XZI = w) 

where Tf, denotes the tree that contains all suffixes of length at most b. 

One can check that the distance ds between pi{-\Th{x)) and pi{a\x) is at 
most Xi{b). It follows that, the faster Xe{b) decays, the better X{i) can be 
approximated by Markov chains of finite order. 

Theorem 16 (Rates for Chains with Infinite Connections). Suppose 
Condition AGCT and Assumption 4 hold. There exist constants C4, C5 > 0, 
depending only on c, \S\, qmm, Tq and 7 such that if 

(6.4) ^l^>c,{l + a)\ 



log^"" n 



then 



sup 



d{Pn{-\x),pi-\x)) 



< --j — I > 1 — — n 



For example, this probability is 1—0 {n^^) if ckl ^ 1 (i-6-, L < n), 6 = n""^ 
and 7 > 7. In contrast to the previous section, the difference between the 
ii and i.2 cases is not apparent in the simplified bound presented above. 
We also note that, with high probability, the estimated tree only contains 
strings of size O(logn): q^am > implies that any string w has stationary 
probability < (1 — Qmin) ; so no strings of length much larger than logn 
will ever be seen in the sample. No statement about "lower bounds" for T„ 
in the spirit of Theorem 13 can be made at this level of generality. 

7. Simulations. In this section we conduct Monte Carlo experiments 
to assess the finite-sample performance of the proposed estimator. We use 
two different designs for the true context tree T* in these experiments: (i) 
a full binary Markov chain of order 3, and (ii) a sparse binary VLMC with 
infinite length associated with a renewal process. The associated context 
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(•3 1) (1 1\ (1 1\ (1 3\ (3 1\ /I 1-, /I 1-, /I 3-, ^_-'_ 
^,4' 4/ \2' 2j \2' 2j \i' ij 1,4' 4/ V2'2/ V2'2/ V4'4/ 

Table 1 

The context trees above illustrate the two models used in our simulations. The left 

context tree correspond to a full binary Markov chain of order 3. The right context tree 

correspond a process with infinite memory associated with a renewal process. 



trees are displayed in Table 1. The former model corresponds to a parametric 
model with well separated conditional probabilities. The latter corresponds 
to an infinite context tree induced by a renewal process. These designs are 
two extreme cases to help illustrate the performance of the estimator on 
balanced and unbalanced context trees. For each design we consider the size 
of the group to be L = 1, 10, 100, various sample sizes n, and two different 
choices of the regularization parameter, c = 1.01,0.5. In all simulations we 
used d^ = \\ ■ lloo, A; = 1, r = 2, m = 2, and set the confidence level with 
1-5 = 0.95. 

Table 2 displays the model selection performance of the proposed algo- 
rithm for the full binary Markov chain of order 3 when the parameter c is set 
to 1.01 and 0.5. In the case of c = 1.01 that follows the theoretical recom- 
mendation of the previous section, in every instance the estimated tree T„ 
was contained in the true context tree T* confirming our theoretical results. 
Moreover, the estimated context tree contained a full binary Markov chain 
of order 2 in most instances. In the larger sample size with 100 groups, we 
achieved perfect recovery of the model. When we set c = 0.5 additional nodes 
not in T* are occasionally included (the average number of extra nodes is 
displayed in the last row of the table labeled as "extra"). If multiple groups 
are used in the estimation, the number of extra nodes selected was smaller. 



The renewal process is defined by the independent times tj's between ob- 
serving two I's. We specified the random variable tj such as P{ti = k) = 
l/[(21og 2 — l)fc(4A;^ — 1)]. Stationarity requires that the first time be drawn 
from a different distribution, see [13, 17] for details. We note that the poly- 
nomial decay of the tail suggests potentially long estimated trees. Table 3 
displays the model selection performance of the proposed algorithm for the 
renewal process when the parameter c is set to 1.01 and 0.5. In the case of 
c = 1.01 that follows the theoretical recommendation of the previous section. 



28 



BELLONI AND OLIVEIRA 







Probability of selection with parameter 


c= 1.01 








n=1000 




n=2500 n= 


=5000 


Node 


L=l 


L=10 L= 100 


L=l 


L=10 L= 100 


L=l 


L=10 


000 


0.0 


0.0 0.0 


0.02 


0.48 1.0 


0.68 


1.0 


100 


0.0 


0.0 0.0 


0.02 


0.48 1.0 


0.68 


1.0 


010 


0.0 


0.0 0.0 


0.03 


0.17 0.83 


0.57 


1.0 


110 


0.0 


0.0 0.0 


0.03 


0.17 0.83 


0.57 


1.0 


001 


0.0 


0.0 0.0 


0.03 


0.11 0.09 


0.53 


1.0 


101 


0.0 


0.0 0.0 


0.03 


0.11 0.09 


0.53 


1.0 


Oil 


0.0 


0.0 0.0 


0.04 


0.42 1.0 


0.69 


1.0 


111 


0.0 


0.0 0.0 


0.04 


0.42 1.0 


0.69 


1.0 


00 


0.36 


1.0 1.0 


1.0 


1.0 1.0 


1.0 


1.0 


10 


0.36 


1.0 1.0 


1.0 


1.0 1.0 


1.0 


1.0 


01 


0.4 


0.98 1.0 


1.0 


1.0 1.0 


1.0 


1.0 


11 


0.4 


0.98 1.0 


1.0 


1.0 1.0 


1.0 


1.0 





0.66 


1.0 1.0 


1.0 


1.0 1.0 


1.0 


1.0 


1 


0.66 


1.0 1.0 


1.0 


1.0 1.0 


1.0 


1.0 


root 


1.0 


1.0 1.0 


1.0 


1.0 1.0 


1.0 


1.0 


extra 


0.0 


0.0 0.0 


0.0 


0.0 0.0 


0.0 


0.0 






Probability of selection with parameter c = 0.5 








n=1000 




n=2500 n= 


5000 


Node 


L=l 


L=10 L=100 


L=l 


L=10 L=100 


L=l 


L=10 


000 


0.92 


1.0 1.0 


1.0 


1.0 1.0 


1.0 


1.0 


100 


0.92 


1.0 1.0 


1.0 


1.0 1.0 


1.0 


1.0 


010 


0.83 


1.0 1.0 


1.0 


1.0 1.0 


1.0 


1.0 


110 


0.83 


1.0 1.0 


1.0 


1.0 1.0 


1.0 


1.0 


001 


0.84 


1.0 1.0 


1.0 


1.0 1.0 


1.0 


1.0 


101 


0.84 


1.0 1.0 


1.0 


1.0 1.0 


1.0 


1.0 


Oil 


0.91 


1.0 1.0 


1.0 


1.0 1.0 


1.0 


1.0 


111 


0.91 


1.0 1.0 


1.0 


1.0 1.0 


1.0 


1.0 


00 


1.0 


1.0 1.0 


1.0 


1.0 1.0 


1.0 


1.0 


10 


1.0 


1.0 1.0 


1.0 


1.0 1.0 


1.0 


1.0 


01 


1.0 


1.0 1.0 


1.0 


1.0 1.0 


1.0 


1.0 


11 


1.0 


1.0 1.0 


1.0 


1.0 1.0 


1.0 


1.0 





1.0 


1.0 1.0 


1.0 


1.0 1.0 


1.0 


1.0 


1 


1.0 


1.0 1.0 


1.0 


1.0 1.0 


1.0 


1.0 


root 


1.0 


1.0 1.0 


1.0 


1.0 1.0 


1.0 


1.0 


extra 


4.22 


0.0 0.0 


8.67 


0.25 0.0 


27.55 


11.02 



Table 2 

The table illustrates the model selection performance for selecting nodes of the true 

context tree in the full binary Markov chain of order 3. 



in every instance the estimated tree T„ was contained in the true context 
tree T* confirming our theoretical results. As expected, as the sample size 
increases the estimated context tree also increases chasing the infinite true 
context tree. When we set c = 0.5 additional nodes not in T* are occasion- 
ally included (the average number of extra nodes is displayed in the last row 
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of the table labeled as "extra"). Nonetheless, when multiple groups are used 
in the estimation, no node outside of the true context tree are selected. 

8. Linguistic rhythm differences between European and Brazil- 
ian Portuguese. In this section we revisit the application and the data 
considered in [16] regarding the linguistic features underlying the Euro- 
pean Portuguese and Brazilian Portuguese languages. For each language, 
the data consist of articles from a popular daily newspaper from the years 
1994 and 1995. For each year and each newspaper, 20 articles were randomly 
selected. The linguistic features are represented by a quinary alphabet with 
four rhythmic features and an additional feature representing the end of 
an article. The four rhythmic features represent: non-stressed, non prosodic 
word initial syllable (0); stressed, non prosodic word initial syllable (1); non- 
stressed, prosodic word initial syllable (2); and stressed prosodic word initial 
syllable (3). 

In [16], for each newspaper, the 40 days sample is concatenated into a sin- 
gle string containing respectively a sequence of 105326 and 97750 linguistic 
features. In order to concatenate articles from different days, a homogeneity 
assumption was required. However, heterogeneity over different days, or at 
least over the different years are a source of potential concern. For example, 
1994 was a World Cup year and the media in both countries are heavily 
influenced by such event. 

We propose to account for the possible heterogeneity on the conditional 
probabilities and consider each year as a group in the group context tree 
model. Thus we allow for year specific conditional probabilities. Figure 2 
displays the estimated context trees. As in [16], we find that the European 
Portuguese has a more complex context tree possibly reflecting the changes 
the language suffered during the 18th century. Both context trees are similar 
to the trees found in [16] confirming their finding even in the presence of 
heterogeneity. 

9. Conclusion. Understanding the memory structure of stochastic pro- 
cesses has proved to be of fundamental importance in applications. VLMC 
models have been playing a central role in the modeling and estimating sta- 
tionary processes with discrete alphabets. In this work we consider an exten- 
sion of the traditional VLMC in which many stationary processes share the 
same context tree but potentially different conditional probabilities. Since 
we allow for potentially infinite memory processes, we propose to focus the 
estimation on an oracle context tree that optimally balances the bias and 
variance trade-off for a given sample. 
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The table illustrates 
context tree in 



Table 3 
the model selection performance for selecting nodes of the true 
the binary Markov chain induced by renewal process Xt . 
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Fig 2. Estimated context trees for the Brazilian Portuguese and European Portuguese 
languages based accounting for heterogeneity in different years. 



We propose a computationally efficient estimator for the underlying con- 
text tree and the associated conditional probabihties. We establish several 
properties of the proposed estimator including oracle inequalities for model 
selection and estimation of conditional probabilities. Two methodological 
applications, discrete dynamic stochastic programming and discrete choice 
models, motivated the proposal of the group context tree model. In these 
applications we are interest on functionals of the conditional probabilities. 
We developed the uniform bounds for the estimation of these functionals 
accounting for possible misspecification of the estimated context tree. We 
also propose and analyze data-driven choices of the penalty choices for the 
regularization, and study its typical behavior under /3-mixing conditions. 

Finally, we investigate the application of the group context tree model 
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and the proposed estimators to investigate the rhythmic differences between 
Brazihan and European Portuguese accounting for possible heterogeneity in 
the sample. Our results fully support previous findings of the literature. 

APPENDIX A: TECHNICAL PROPERTIES 

We start by establishing technical results that follows from the definition 
of the pruning algorithm. We begin with an alternative characterization of 

Tn. 

Proposition 1. The estimated tree Tn equals the smallest tree contained 
in En which contains all w G En with CanRmv(tt;) = 0. 

Proof of Proposition 1 . Let Sn denote the aforementioned "smallest 
tree". Clearly, Sn C T„, as any w with CanRmv(it;) = will not be removed 
from Tn in PruneTree. To prove that En\Sn C En\Tn, we will use the 
following claim: 

Claim 1. If v £ En\Sn, then Vw £ En : w ^ v ^ CanRmv(i(;) = 1. 

Proof of Claim. In contrapositive form, if some w ^ v satisfies that 
CanRmv(zi;) = 0, that w belongs to 5„ by definition, and then v G Sn because 
Sn is a tree and v <w. D 

One may use induction starting from the leafs to deduce that, if i; G En 
is such that the conclusion of the Claim holds, it will be removed from T„ 
at some stage of PruneTree. We deduce En\Sn C En\Tn- □ 

It turns out that, except for the root, any node of the estimated tree 
Tn must be the closely connected with two nodes that yields substantially 
different probability distributions. 

Proposition 2. Suppose v G r„\{e}. Then there exist w',w" G En with 
w' >: par(u), w" ^ v and 



d{w')\L ,^ + ||cf(u;")||T , < \\d{pn{-\w'),pn{-\w" 



lL,fc- 



Proof of Proposition 2. Assume (to get a contradiction) that no such 
w',w" exist. In that case one can easily check that CanRmv(tt;) = 1 for all 
w ^ V. In particular, the subtree of En obtained by removing v and all of its 
descendants contains all u with CanRmv(n) = 0. Proposition 1 then implies 
V ^ Tn, which contradicts the assumptions of the present Proposition and 
finishes the proof. D 
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Finally, the following result formally states the compatibility between the 
tree structure T„ and the probability distributions P„(-|x) which follows 
immediately from the pruning definition. 

Proposition 3. Let x,y e Azl^ satisfy fn{x) = fn{y). Then Pn{-\x) = 
Pn{-\y). 

APPENDIX B: PROOFS OF SECTION 3 

Proof of Theorem 1. We prove the case that r, m, and r are finite 
(the other case follows similarly). If mini<£<L Nn-i/{w) = by definition we 
have Pn,e{o.\'w) = Pn/ia\w) = l/\A\ and the result follows. Otherwise, using 
the definitions of ||-||l ^ and |HIl ri ^^^ Holder's inequality (since m > k) we 
have 



1 

M(Pn(»,Pn(»)|lL,fc = Y^4{PnA-\'w),pnA-\^)) 



^ 1 y^ 4'(P«,^(-k):P»,£(-|w)) fc 

k_ 

' dl{PnA-\w)-,PnA-\w))^ 



di{w) 



1=1 



|cf(u;)||L,jH" • 



L,m 



Thus, if Goodm occurs, for any r > km/(m — k) we have 

:»,p(»)|Il. < l|cf(u;)||. , 



since IbllT , < \\v\\r ^ if r' < r. D 

Proof of Theorem 2. We need to show that no node outside of T* 
belongs to r„. A string z £ A* lies outside of T* if there exists a leaf w 
of T* such that w ~< z, w ^ z. Notice that, if such a w exists, we have 
pA'W) — pA^"^") for all w',w" >: w. Thus, if Goodm holds, by Theorem 1 
we may deduce that: 

\\dip^i-\w'),pAW))\A, < \\d{pA-W),pA-\w'))\A, + \\d{pA-\w"),pMw"))\A, 

< ||cf(«;')||L,. + l|cf(«;")llL..- 

This implies that the entire subtree of En rooted at w is pruned in the 
criterion (2.4) since c > 1, and therefore z T„. D 
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Proof of Theorem 3. Let x G A_]^ and set w = T{x) and w = Tn{x). 
Since ||cf(t(;)||L^ is monotone in w, if \w\ < \w\ we have ||cf('fB)||L^ < 

llcfHIlL,,. 

Next assume that \w\ > \w\ and let w = par{w). By Proposition 2 there 
exist w' , w" G En with w' ^ w ^ w, w" ^ w ^ w and 

|cf(^')||L,. + l|cfK)||L,r ] < ||d(p„0'),PnO"))||L,fc- 

On the other hand, we claim that 

Claim 2. In Good^, for all w' ,w" G A* with w',w" >z w, 

\\d{pn{-\w'),pn{-\w"))\\^^^ < 2cy, + ||cf(u;')||L^^ + ||cf(u;")||L^^. 
The Claim implies: 



\cf{w')\l +||cf(it;")||T 

I ^ ' \lL,r M ^ 'llL,r 

which (since c > 1) is the same as: 



<2c^ + ||cf(u;')|L +||cfK) 



ZCq, 



lL,r' 



|cfK)L,. + ||cfK)IL,,<^_i 



It follows that 



l|cf(^)llL..< 



ZCi, 



\cf{w) 



lL,r' 



since cf^(-) is increasing, w < w' and w < w" . Thus we only need to prove 
the Claim in order to finish the proof of the present Theorem. 

To prove the Claim, fix some w' G -E„ with w' ^ w and write s = \w'\. 
We observe that: 



||d(p„(-|ii.),p„(-|»'))|lLfc < 



(B.l) 



E,"=s + lXj^,-l(^j^^,j<i(!(p„,f(-|™),P«<-|^i<i<f))) 



"„-l.f(™') 



- '"P{.(£) : .ll(f) = ™}L_ ||{<^f(P..,K-|™),Pf(-kW))}Ll 



t = l 



(m = T(x) and Rem. 1) < s"P{j(j!) : T(i(f)) = ra}L || { ''^ (Pn.f (' I™). Pf (' |z(*)))}^ = i || ^^ ^ 

Recall from Theorem 1 that if Goodm holds, \\d{pni-\w'),pn{-\w'))\\^ j^ < 
|cf(w')||L^ for all w' G A*, hence by the triangle inequality: 



In Goodm, yw' G A* : 



lL,r- 



Comparing w',w" >z w via the triangle inequality and (B.2) finishes the 
proof of the Claim. D 
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Proof of Theorem 4. First apply Theorem 3 with |T„(j;)| > |r(x)| so 
that 

cf(7;(x))ll - ^''^("^ 

Next, by condition RL, we have 



< 

L,r C — 1 



|Tn(a:)|-|T(:r)| 

cf(T„(x)) > ||cf(r(x))||L,, ; ^ 



since d^{Tn{x)) < 1 so that cf(r„(x)) 
bining both inequahties. 



< 1. The result follows by com- 



D 



Proof of Theorem 5. By the triangle inequality and the definition of 

CT{,x) 



d{Pn{-\Tn{x)),p{-\x)) < d{pni-\Tnix)),Pni-\T{x))) 



L,fc 



L,fc 



+ Ct(x) 



SO that the second relation follows from the first. 

To prove the first relation, note that under Goodm, by Theorem 1 we 
have 



(B.3) 



\\d{p^{-\nx)),pMT{xm\uk < \WHT{x))\u, 



We divide the rest of the analysis into three cases. 

Case 0: |T„(x)| = |r(3;)|. The result follows from the relation above since 

r„(x) = T{x). 

Case 1: |T„(a;)| < |T(x)|. In this case by the triangle inequality and (B.3) 
we have 

||d{p„(-|T(a.)),p„{.|f„(a.)))|| < \\d{p„(-\T„{a:)),p„(-\T(xm\\ + \\d{p„(-\T(x)), p„(-\T(x)))\\^ ^ 

II ML, A; II IIL,fc ' 

< \\d{p„(-\T„{a:)),p„(-\T(xm\\ + \\cf(T{a:))\\^ ^. 

II IIL.fc ' 



Next note that T{x) was pruned since |T„(3;)| < |T(a;)|. Therefore, 



diPni-\Tnix)),Pni-\T{x))) 

Thus we have 

d(p„(.|T(x)),p„(-|T„(x))) 



L,k 



< C 



< c 



L,k 



cf(T„(x)) +||cf(r(x))||L,, 



cf(T„(a;)) +(l + c)||cf(T(a:))||L., 

L.r 



< (l + 2c)||cf(T(.T))||,, 
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Since 



cHTn{x)) < ||cf(r(2;))||L^, because |T„(x)| < \T{x) 

L,r ' 



Case 2: |T(j;)| < |T„(x)|. In this case, notice that by (B.l) we have 



d{pni-\T{x)),Pni-\Tnix))) 



< 



L.k 



Ct(x)- 



Hence, under Goodm, using the triangle inequality and Theorems 1 and 3, 
we have 

\\d(Pn.(■\T,^(x)),p„(■\T{x)))\\ < ||d(p„(-|T„(a:)),p„{.|f„(a:)))|| + 11 d(p„ (■ 1T„ (a:)), p„ (■ |T(a.))) 11 

II II La ,k II II Li, fc II II ij,K 

< ||rf('rr.(^))||^ ^ + ctc^) 

2ct(x) c + 1 
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APPENDIX C: PROOFS OF SECTION 4 



Consider a collection of subsets S Q 2^. We consider the pseudo- metric 
on the simplex A given by: 

de{p,q) = ds{p,q) = sup \p{A) - q{A)\. 
Aes 

For each process i = 1,...,L and S" G 5, we have empirical transition 
probabilities: 



PnA^l'^) 



NnA^S) 



where Nn/iwS) = / ^Nn/{wa) 
aes 



Nn-l/{w) 

and the "oracle probabilities": 



PnAS\w) 



Nn-lA'^) 



if imn.i<ii<j^ Nn-iA'^) > 0- Both probabilities are defined as |5'|/|74| when 
mini<£<L Nn-i^e{w) = 0. We will study these quantities using the martingale 
framework from Appendix E. A simple calculation (omitted) proves the 
following proposition. 

Proposition 4. Given w e A*, S £ S and 1 < ^ < L, write XqA'^S) = 
and for 1 < t < n: 



XtAwS) = NtAwS) -Y.X{xu Ai)=n.}Pi{S\Xl^{i)). 



i=l 



i— |tjj| ^ 
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Then this is a martingale with filtration a{X^^{i))t>o o.^d quadratic varia- 
tion process 

Vt,i ^ Vt^,iw) = zU ^[l^j-.^ - ^j-i/l V(^ioo W)] 

< Nt-l/{w)pn/{S\w){l -pn,£{S\w)). 

Proposition 4 will be useful in our calculations because Xt/ defined above 
satisfies: 

(C.l) de{pnA-m^Pn,d-m) = sup — -^ — -— . 

SeS Nn-1/[W) 

Note that the proof of Theorems 6 and 7 follows directly from Theorem 
8. The proof relies on martingale inequalities developed in the Appendix E. 
For 7 > 1 define 
(C.2) 



, , 2a2(u)) / r4|5|(logn)log(n2L/^)1 „, r„ , r 9. n ,. / m1 



if 



^l(»)A'„-.(«.) > ■," > ^Ml'l^l/*) + 2'°8Ki + '»)(2 + '»)! 



log^(2-(l/7)) 



where io be the smallest such integer, and 

(C.3) 



. ^ I 2^y L f 4151 (log n)log(n2LM)1 ~ T ' ~~ "TT 

'^^H = i/t7 — Vry^Qgi 11 — U2iog{2 + iog^iv„_i,,H }, 

y A^„-i,Hw^) V I loglogn J L 7L 

otherwise. The results stated in Section 4 correspond to choosing 7 = 2. It 
follows that cf^ (tf) > Ci{w) with 7 = 2. 

Proof of Theorem 8, Goodoo- For notational convenience let (i£(t(;) = 
deiPnA'\'^)^PnA'\'^))^ a^d the event E^ = {mmi<i<i^ Nn-i^w) > 0}. 

FfsweA*: max ^ > l) < V P f max 4t4 > ll^^' V (^-) • 

Using (C.l), the union bound and Lemma 5 applied (twice) to each 5 € 5, 
and for every £ = 1, . . . , L, we have that 



max ^,>l\E^<Y.E^ f IxT^T^TtW > 11^-1 ^ ^■ 

i=i,...,Ldiiw) ' y ^^ v^n-i/(w^)cf£(u;) y n2 
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The result follows by noting that Y2weA*^ i^w) — ^ l'^ where the last 
expression is the maximum number of different substrings of a string of 
length n (note that E^ requires the substring to appear in all L strings). D 

Before proceeding to the second result of the section we need a technical 
lemma. 

Lemma 2. Assume that ci{w) is a random variable such that for some 
M > 1 we have that for every i = 1, . . . , L 



j2/. 



P(df(!5„,<.(-|™),P„,<.(-|"')) > v^A?cfj.(iD)|]V„_i^(iD) > 0) < S'^ . 

Then 



L^ cf,H ; 



Fl3weA*: min N, 

\ 1<£<L 

Proof. For notational convenience let de{w) = di{pn,ii-\w),pn/{-\w)) 
and the events Eg^w = {Nn-i/{w) > 0} and E^ = {mini<^<L iV„_i^^(tt;) > 
0}. First note that 



Then we have that for each w £ A* and M > 

/ 1 Jt, dj(w) \ / 1 -A dj(w) , 9 A 

«■ 7E il74>' + ''l^™ ^"P 7E1IH >i + ''|E-.d?M<A«'M + 



+L 



max P('d,(tu) > \/Mcff(w)\Ef „.) . 



Note that condition on the events Ec^^ and {d'j{w) < Mc^'j{w)}, we have 
that the variable Zg := {dj{w) / dj{w)) - 1 is such that E[Zi] < 0, \Z(\ < M, 
and E[Zf] < M. Then by Bernstein's inequality we have 



-^Zi>h\ E^,dj{w) < Mdj{w) j < exp 



1 h^L 



2M(l + /i/3) 

and 

L max P (dAw) > y/Mde(w)\EiJ] < L5^ . 
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Therefore, 



' L ^«-i rff(uo y ^v V 2A/(i + fe/3)y / 4-:. 



1 / /I ft L 

< cxp + L(5'' 

1\ \ 2 M{l + h/Z) 



U 



Proof of Theorem 8, Good2. 'Leidi{w) = di{pni{-\w),Pn/{-\w)), Ei^^ 
{Nn^i^eiw) > 0}, and Ey, = { min N^^i^eiw) > 0}. 

Let M := log(ra^L/^)/loglo gn > 1, /i = 1/logn, (5^ = /x/[2(l + /x)M|5|], 

/^' := (3/2)^ '°g("lYoS!.;^ P^ and cK^/;) as defined (C.2) and (C.3). 
By Lemma 5 applied to each S € S, we have that 

¥{de{w) > ce{w)\Ee,^) < 2\S\6c = ^/[(l + ^)M]. 

By definition of /i, Ci{w), and the relation above, we have 



E 



djM 



djM 



+M¥{de{w) > ce{w)\Ee^yj) 

— 1+M (l+/i)M 

By Lemma 5 it also follows that for any M > 1 we have P{de{w) > 
My(l + /i)c£(w)|-E^^t„) < 6^ and both conditions of Lemma 2 holds for 
the confidence band being ^/(l + ^)ci{w). 
Therefore 

1 J^ dj{w) ,\ n? ( ( 1 m'^L 

3»e A* :B„.-5^ ^^_^', ,, . >1 + A' 1 < — l<=xp(--- 



+ 



H)c2{«,) / 2 V V 2M(1 + m73) 



Note that n^LJ^^ < 5 provided that M > log(n2L/(5)/log(l/5c)- Since 
2(1 + ^)M\S\ > I, 5c < fJ- = 1/logn, our choice of M satisfies this relation. 



2„_ _l^^^^l^_\ ^ X ^,~r..AA^A .l.„f A^' ^ . /MlogjnyS) 



Next note that n exp ( — 2 mTiTTTTsT ) — ^' Pi'O'^ided that iJti/^ > y °^£^' 
which is satisfied by 



,_ 3 /log(n2L/(5)log(nV(5) 



2 y log log n L 

since we assumed log^ {n'^L / 6) < SLloglogn. 



D 
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APPENDIX D: PROOFS OF SECTION 5 

Proof of Lemma 1. For x £ ^I^ and u £ U, denote by K = V{x) 
the true value function, P^^x denote the true transition probability (infinite) 
matrix, V^. = V(Tn{x)) the value function associated with the estimated 
transition probabilities Pu,x = ^uf (x)' 

Each value function is the fixed point of contraction mappings H, H on 
the functions W : AZ^o — ^ ^- Formally, the mappings 

H{W){x) =max{f{x^i,u)+f3Pu,xW} and H{W){x) = max{fix.,,u)+/3Pu,xW} 

are contractions with modulus (3 by Blackwell's sufficient conditions. 
Therefore we have 

11^ - ^lloo < IIV' - ^(^)lloo + ll^(^) - ^lloo 

= \\H{V) - H{V)\\^ + ll^(^) - ^(^)lloo 

Thus, \\V - y||oo < ||^(J) - H{V)\\^/{1 - /3). where \\H{V) - H{V)\\^ = 
max^ ^-1 \H(y){x) — H{V){x)\. Thus the result follows by showing that 



\H{V){x) - H{V){x)\ 



niax{/(a;_i. It) + pPu.xV] - max{/(a;_i, w) + (iPu^xV} 
<l3rRax\{Pu,x-Pu,x)V\ 



/3niax 
lieu 



'^[PnM{a\Tn{x)) ~ Pu{a\x)]V {ax) 

aeA 

</3||F(-a;)||^max||P„,„(-|x)-p„(-|a;)||,. 



u 



Proof of Theorem 9. The result follows by applying Lemma 1 with 
g = 1 , which correspond to d^ = 1 1 • 1 1 1 /2 , and Theorem 5 with r = k = m = oo 
to bound 

max < CT(^)+max| (1 + 2c)||cf(r(x))||L^^, -^—jCTix) > 

The bound on ||cf(T(2;))||L ^ follows from Theorem 6 with the family of 
sets 5 = 2^. D 

Proof of Theorem 10. By the choice of cf we have that with proba- 
bility at least 1 — 6 the event Goodm occurs. Fix a £ A, x,y £ AZ^o let 



mi{a, x, y) = Pn/{a\T{x)) - Pn,i{a\T {y)) , AVEm(a, x, y) = E [mi{a, x, y)] 
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and note that 
(D.l) 



I ^ 

AVEm(a, x, y) - AVEm(a, x,y) = - ^[m£(a, x, y) - mi>{a, x, y)] + 

£=1 
I L 

+Y ^ ■mi{a, X, y) - AVEm(a, x, y)+ 



+AVEni(a, x, y) — AVEm(a, x, y). 



First note that for do 



we have 



|AVEm(a,a:,a)- AVEm(a,a:,a)| < || d(p„ (• |T(x)) , p(. [a:)) ||l,i + || d(p„ (• |r(!;)) , p(- |h)) ||l,i 

< cr(x) +CT(y)- 



Next, define -ElIo, x,y) = j^ ^^=i ^^(a, x, y) — AVEm(a, x, y) and note 

that since |m^(a, it;,tf')| < 1 we have -El (o, a;, y) < [2/L]+[(L—l)/L]Ei^-i{a,x,y). 

Moreover, note that Ei^{a,x,y) = Ei^{a,T{x),T{y)). Thus, we need to 
consider only suffixes w £ T (m particular, leaves of T) which implies that 
Nn,e{w) > for every i = l,...,L. Thus, for ^ = (e - [2/L])L/(L - 1) we 
have 



P I max £^L ('^!^!^)^^1 ^-f( max -E-L— i(ci, tn, iv ) > ^ 

KaeA,w,w'eT I \aeA,w,w' eT 



aGA, ^^ 



max -E'L— i(a,^f',^f ) ^ ^ 

linf N„ ,(™)>0, 



dnj JV„ ,(™')>0 



max £Jt _i (a, tn, w ) > E 

™:«„,I.(™)>0. 
\" •™':Af„_L<™')>0 



< 5Z -P (iVn,L(m) > 0,iV„ l{"'') > 0, BL-l(a,™,m') > e) ■ 



The event {Nn^hiw) > Oj^n^hiw') > 0} is independent of {Ei^^i{a,w,w') > 
^}. Moreover, since \fh£{a,w,'w')\ < 1 are i.i.d. draws from the population 
of agents, for any ^ > 

P{Ei^.i{a,w,w')>^) <exp(-(L- 1)^2/2) 
by Hoeffding's inequality. Therefore, 

p( max E,(a,w,w')>i\ < I Al oxp(- (L - 1)C^ /2) ■ ^^ P ( N„ ,(w) > 0, N„ ,(w' ) > o) . 

Next note that X] ™eA* P (Nn^hiw) > 0, Nn^h{w') > 0) is the expected num- 

w'eA* 

ber of different pairs w,w' G A* appearing in X^{i). Therefore we have 

Y, P {NnA-^) > 0,Nn,Uw') > 0) < nV4. 

weA* 
-w'eA* 
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Finally, to control the first term of (D-1), since dg 



we have 



r T.e=i [rnda, x, y) - mi{a, x, y)] 



zT.i=l[PnA(^\Tn{x)) -pnAO'\T{x))] 

T:T.]!=i[PnAo'\Tn{y)) - Pn.eHT{y))] 



dipJ-\Tjx)),pJ-\T{x))) 
d{pA-\TAv)),Pn{-\T{y))) 



L,l 
L,l' 



Under the event Good2, by Theorem 5, uniformly over z G A_]^ we have 
that 



d{pn{-\fn{z)),Pn{-\T{z)))\^ ^ < max {(1 + 2c)||cf(T(z))||L,2, ^Cr(.)} 
The result follows by combining these bounds. 

APPENDIX E: A COMPENDIUM OF MARTINGALE RESULTS 



D 



Lemma 3. Let {Mi,Ti)^Q be a martingale with Mq = and \Mi — 
Mi-i\ < Yi-i for some Fi-i-measurahle r.v. y^-i. Define Vn = S?=o Y?- 
Then: 

VA, w > : F (Mn > X,0 < Vn < v) <F (K > 0) e"^. 

Proof. Step 1: Main arguments. Write E = {Yl^=l ^f > 0} and define 



Ur 



2 2^3=0 ^j [0 <r <n) 



where s > will be fixed later. By Step 2 below {Ur,J-'r) is a supermartingale. 
Now notice that: 



n-l 

M„ > A, ^ y/ < t; ^ sMn - s 

j=0 

Therefore, 

n-l 



9 9 n — 1 

S V S 



S V i> V — ^ 9 

a + ^-tE^^o^^^^ 



S V „ \ 



i=o 



M„ > A, < ^ y/ < V I < e' 

j=0 



-sA 



^[UuXe]- 



The result follows by considering s = X/v and noting that E [Un Xe] < 
P (K > 0) by Step 3 below. 
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Step 2: {UriJ'r) is a supermartingale. Since Yr is J>-nieasurable, 



E 



U, 



r+l 



Ur 



Tr 



E 



„s(Mr+i-Mr) 



Tr 



e 2 



Recall that l-M^+i — M^j < 1^, hence by convexity 

^s{Mr+i~Mr) < cosh(sy^) + sin(sy^)(M^+i - Mr 
Taking conditional expectations, we see that: 



E 



^s(Mr + l-Mr) I JT 



< cosh(si;)+sin(si;) E [(M^+i - Mr) \ J>] = cosh(syr; 



This implies E [e'*(^'-+i~^'-) I J>] e 2'' < cosh(sy,.)e 2'' < 1 since cosh(x) < 

S'^ep 5; E[C/„Xi?] < 1P(K > 0). Write £"0 = {^0 / 0} and 5"^ = {Yj / 
0} n {Ifc = 0,0 < A; < j}. Notice that E = ^o<j<n-iEj (where the union is 
disjoint) and that each Ej is J^,-measurable. Moreover, if E^ holds, we have 

■'j=o ^ j 



Y^^'l y2 = and Mk = Mo = 0, hence Uk = l. Therefore, 



ra-l 



^[UuXe] = Y.^[Ur,XE,] 
fc=0 
n-1 

{Ek is Jfc-measurable) = ^E [xe^E [C/„ | 7"^]] 



fc=o 

n— 1 



(C/fc is supermartingale) < ^E[xSfef^fe] 

fc=o 

{Uk = lmEk) = Y.^{Ek)=F{E) 



D 



Lemma 4. Lei (Mj, Ji)™Q 6e a martingale with Mq = 0, |Mj — Mj^ij < 
l^_i < 1 for some Ti^i- measurable r.v. Yi-i. Define Vn = X]^=i -^[(^j ~ 



M. 



j~i) Ki-i 



J-"i_il. Then: 



yX,V >0 : F {Mn > X,0 < Vn < V) <F {Vn > 0) e~l^(2-cxp{A/^))^ 

Proof. Step 1: Main Arguments. Write E = {Vn > 0} and define 



Ur = e*^'-^Si=i^[(^^-^^-i)'l^^-il (0 < r < 



n 
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where s > will be fixed later. It follows that (Ur,J-r) is a super martingale 
by Step 2 below. 
Now notice that: 



Mn > X,Vn < V ^ sMn - sX + 



s^e^v s^e* 



K > ^ U„.e-^ 



-sX 



> 1. 



Therefore, 



' (M„ >X,0<Vn<v) = E 

< E 



2^-sA 



HUne 



>^-XE 



e 2 



•sX 



XE 



nUnXE]- 



The Lemma follows from choosing s = X/v and noting that E [Un Xe] < 
P (E) by Step 3. 

Step 2: {Ur,J^r) "is a supermartingale. Since Yr is J>-measurable, 



E 



U, 



r+l 



Ur 



Tr 



E 



^s{Mr^X-Mr) I JT 



a^e^fi[(Mr^.l-Mr)^|J^r] 



Recah that jM^+i - M^] <!;<!, hence by [20] page 32, 



E 



gS(Mr + l— Mr) I -p 



< exp 



s^e'^ 



E[{M, 



r+l 



M,)^|j; 



This implies E \e<Mr+i-Mr) | j;r.j g- 



'g[(Af^_|_l-Mr)"^|J^r] 



< 1. 



5fep 3: E [C/„ x^;] < P {E). The proof is similar to Step 3 in Lemma 3. D 
For any 7 > 1, (^ G (0, 1), define monotonic function h : [0, 00) — t- [0,oo) 



h{x) = 2x7 log ^ t(1 + log^ x){2 + log^ x) > , 

and let zq be the smallest integer such that 

yMog2(2-(l/7))>21og(2/5)+21og[(l + io)(2 + io)], 
so that 2 - exp(Y//i(7*o)/7»o+i) > ^/^^ 

Lemma 5. Let {Mi,Ti)^Q be a martingale with Mq = 0, |Mj — Mj_i| < 
yj-i < 1 for some J^i^i -measurable binary r.v. li_i. Define Vn = Yll=i E[{Mj- 

Mj-if\J^j-i] andV„, = E]=oYr Then 

P (m„ > v//^(K), K > 7''") +IP (A^n > \lh{Vn)h. < T4 < 7''" ) < '5-IP (K > 0) . 
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Proof. First note that 



1 / i>>o V.= l 

< Z; ^ ( X^ ''i ^ \/'>(7'), < V„ < 7* + l 



< -P(V"„ > 0) ^ cxp 



fe(7') 

' 27^ + 1 



2 — cxp 



V^MtM 



< P(V-„ > 0) XI cxp T- log - - log[(l + j)(2 + i)\\ 



where the last line follows from the definition of iq. Since io > it follows 
that 



^ exp (- log ^ - iog[(i + 0(2 + .)])= ^ E TTT-Tr 

i>io ^ / i>iQ VTA 



)(2 + 



< 



Next, note that the event < V^ < 7*" ^ V„ > and that Vn only takes 
integer values, hence: 



+00 



{K > 0} = y ^i where Ei = {7* < K < 7'+'} 



1=0 



and the union is disjoint. We deduce 

■P ( XI d, > \/h{V„)/7, V„>0) =X;p(XId, > \/h(i>„)/7, 7* < !>„ < 7'+M 

< ^ P ( ji^ di > y'h(7i)/7, < V-„ < 7*+^ 



i>0 \i = l 



< P(V„ > 0) 51] <=xp 



fc(7') 
"27^ + 2 
2 



< P{V„ > 0) ^ cxp (- log - - log[(l + i){2 + i)]) 

i>io 

< P{V„ > 0)S/2. 



APPENDIX F: PROOFS OF SECTION 6 



D 



Proof of Theorem 11. Fix T e Tso- Since T is complete and finite, 
T(x) is always a leaf of T. Thus our goal can be restated as: 



(F.l) 



sup 

wes 



\d{w) 



lL,r 



\d{w) 



< 



C2 

logn' 



where S is the set of leaves of T. To prove this, we apply Theorem 17 to 
each of the processes X{£)t.'^, replacing n with n — 1 and choosing the other 



46 



BELLONI AND OLIVEIRA 



parameters as follows: 

5 = 5^ = — , S, = , S = {w £ A* : w is a leaf of T, min ■K(,{w) > 0}. 

L Inn 1<^<L 

Notice that the quantities tts and hs in Theorem 17 equal tTj, and hj, (resp.). 
Moreover, tti{w) > vr^ for all w G S so that 

\S\ vr^ < 5; P (nX(l):L) = u;) < 5^ P (nX(l):L) =«;)<!. 

toe5 iogA* 



Hence jS*! < tt- < n by our assumption on T. Using Remark 9 (after the 
Theorem 17), we see that its conclusion will hold whenever: 

^r^n^ (/i^ + l)log^n ^ r< n , ^log^ n ^^ [ Bufi 



TTj^ 



and n > Ci (1 + a) 



TTj^ 



r 



6 



for some universal constant Ci. The first condition holds since T € T^q. 
1 1 



For the second, we notice that Assumption 2 implies /? ^(x) < (T/x) 
l + (rVa;^). Thus 



/3 



6 



1 -1 1 -1 / IN -1 1 -1 

< l + (6r)7,^, ^log^nvr-^ < M +(6r)7J 5* Mog^nvr-^ 



]_ _i 37 + 1 2+1 

Hence we need: n > C (l + a) (1 + (GF) ^ ) 5* ^ log t n vr-^ , or equivalently 



:7+i 



(F.2) 



T — 



ni+i 



,[Ci(l + a)(l + (6F)')]7+i/ log^n 



for some constant Ci that is entirely determined by C. Since 5* = Sq/L = 
5o, (F.2) is also guaranteed by T G T^q (if Ci is chosen appropriately). 



We conclude that VI < ^ < L: 



y-w leaf of T 
Therefore, 



TT£{w) (n - 1) 



< 



logn 



> 1-5* 



Nn-i,eiw) 



TTi{w) {n - 1) 



1 



< 



1 



logn 



VI < ^ < L, Vw leaf of T : 
Given that this last event occurs, direct calculations yield 

VI <^<L,VwGS, 1 



1-* 

L 



> 1 - 5n. 



C2 dAw) C2 

< _ ^ ^ < 1 + 



logn cf£{'w) ' logn 

where C2 > is universal. From this one concludes (F.l) for any r > 1. D 
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Proof of Theorem 12. Theorem 5 shows that the occurrence of Good^ 
imphes that for ah x G ^Iqo 



d{p{-\x),Pn{-\x)) < Cy5o(^)+max<^ Cj^som,{1 + 2c) d{T <'{x)) 

L,fe ' \ C — i ^ ' 



C+1 



L,r 



However, Crp5Qi^\ < Cj^sqm (by definition of the two quantities) and by the 
typicahty event: 



cf(T^°(x)) 
Thus, for ah x G AZr^ 

d{pi-\x),Pn{-\x)) 



<(1 + ^ 
L,r \ log n 



cf(r^°(x)) 



L,r 



-oo 



. C + 1 
< Cj.^0 (x) + max _ 



-^)(l + 2c)||^(T^-«(^))|L. 
log riy " '"^'^ 



D 



Proof of Theorem 13. Let w eT^. We will show that the occurrence 

of Typ^(T) implies that CanRmv(t(;)= 0, so that w G Tn- By assumption, 

there exist x G ^Iqo ^^^ V ^ ^-oo with with T(x) >z w, T{y) ^ par(t(;) 
with: 

(F.3) 



\\d{p{-\x),p{-\y))\\^^, >(} + ^) (1 + c) 
Notice that if Goodm H Typ^(r) holds. 



d(T(x)) 



cf(r(y)) 



L,r 



||d{p{.|a:),p„(.|T(x))||^ ^ < ||d{p{.|a:),p„(.|T(a:)))||^ ^ + \\d{p^(-\T{x)) , p^{-\T{x)))\\^ ^ 
(dotn. of cj^^j) < S^(^) + ||d(p„(-|T(a:)),25„(-lT(a:)))||^^ 
(Good„ holds and Thm. 1) < c^(^) + l|rf{T(2:)) ||l r 

C2 



log n , 



c\\d{T{x))\\ 



Similarly, if Good^ n Typ^ (T) holds 

d{p{-\y),Pn{-\T{y)) 
Combining these two inequalities shows that: 



< c^ +11 + -^ 
L,fc -* ^^^^ V log n 



cHTiy)) 



L,k 



||d(j3„(.|T{a:)),j3„(.|T(H))||^ ^ > N(p(- |y), p(' l^c)) IIl,^ - ||d{p{. |a) , p„ (■ |T(j/)) ||^ ^ - ||d(p(. |a:), p„(.|T(x))||^ ^ 



> <i(p(-\y),p(.'\x)) - c. 



r(x) -T(y) 



(1 + !§;;) (Ihf(^(-))|L,, + lhf(^(^»IL, J ■ 
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We may now plug (F.3) to deduce that if Good^ n Typ^(T) holds, then: 



(i + T^)Klh'(^w'IL.. + lh'(^M'IL,J 



(Typ^(T) holds) > c \\a{T(x))\\ + cf(T(:c)) 



Therefore, the occurrence of Goodm n Typ^(T) implies that 3w' = T{x) >z 
w,w" = T{y) >z par(w) : 

\\d{Pn{-\w'),Pn{-\w")\\^^^, > C (^||cf(w')||L,fc + 11^^ (^") ||L,fc) ' 

which implies CanRmv(tt;) = 0. D 



Proof of Theorem 14. Fix some x e A_^. If Good^ holds, for all 
w,w' >: T_|_(x), 

d{pni-\w),Pn{-W)) < d{p^i-\w),p^i-\w')) + (lIcfHIlL,, + ||cfK)||L,r 



k ' 



since both p„(-|w) and Pni'l^') are convex combinations of probabilities of 
the form p{-\y) with y ^ T^{x). We have also assumed: 



Cf+(x) ^2^1 ^^^^ 



and if Typ^(T+) holds, we have 

C2 



(c-1) cf(T+(x)) 



L,r 



1 



logn 



cf(r+(x)) 



< 



L,T- 



cf(r+(x)) 



L,T- 



Hence under Goodm n Typ^(T+): for all Wjw' >: 7+(x), 

dipn{-\w),pn{-\w')) <2(C-1) Cf(r+(X)) + (||cf M ||l,, + ||cf K)) IIl,.) 

L,r \ ' / 

<c(||cfH|lL,, + ||cfK))|lL,.), 

since the confidence radii are monotone. This last inequality implies that: 

yw >~ 7+(x), wi ^ w, W2^ par(tf), 

diPni-\wi),Pni-\w2)) < C (||cf (wi) ||l^^ + ||cf (^2)) IIl,^) • 

In particular, all w >- T+(x) are such that CanRmv(i(;) = 1. But any w that 
does not belong to r+ satisfies w >- T^{x) for some x. It follows that all 
nodes v with CanRmv(t;) = belong to r+. In particular, T^ '^T^- D 
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Proof of Theorem 15. Take 7 = (a + l)logn, T^ = x il/iei^W- Sim- 
ple calculus shows that the function /3(-) appearing in the assumption sat- 
isfies: V6 G N, (3{b) < Tyb~^. Hence Assumption 2 is satisfied. 

We now claim that: 

Claim 3. // (6.3) holds, T* gTs- 

Proof of the Claim. We need to check two conditions: 

(r,,. -1 < ( I \ (^)^n^ _ 

(F.4) TTy. < [ — — — ,__^^^^ 15+1 



(F.5) /it* + 1 < 



,[Ci(l-Fa)(l + (6r)7)]7+i/ log^+^n 



C7i(l + a)log^n 



Equation (F.5) is clearly a consequence of (6.3), at least if C3 > Ci. To check 
(F.4), we note that, at the cost of replacing Ci by Ci V 1, we may assume 
that the denominator in the bracketed fraction in the RHS is at least 1. In 
this case we may replace 7/(7 + 1) by 1 while decreasing the RHS. Moreover, 

fQY)l = (6y)V("+i)iogn (^±11M!! < (« + l)log^ 

since 6x < n^'^^. We deduce: 

/ 1 ' ' 



> 



V[Ci(l + a)(l + (6r)7)]^y \Ci{l + a){l + e'^ (^) log 

37+1 o 

We also note that n > 9 =^ log t+i n < log n and (since L/5 = n" 

1 7 7 — a ^ g + l 

(5/L)t+i 77,7+1 =7iT+i >n 1 >n/e 



n] 



by the choice of 7 = (a + 1) log n. We conclude that (F.4) is also implied by 
(F.5) in the Theorem if C3/C1 is large enough, which we can ensure by the 
appropriate choice of C3 . D 

Combining the Claim with the definition of T^g ) by Theorem 1 1 we have 

P(Typ^(T*)) > 1 - 5o- Now assume that Typ^(T*) n Good™ occurs, which 

will be the case with probability 1 — 5 — 5q. In that case the bound on 

d(p{-\x),Pn{-\x)) follows from combining Theorem 12 and the fact that 

L,A: 

Ct* (x) = under the parametric case (Assumption 3) . We also have Tn ^ T* 
from Theorem 2. 
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To finish, we must also prove that T„ 5 T* via Theorem 13 applied to 
T = T^ = T* . In order to check the assumptions, we note that, under 
Assumption 3, if T*{x) is a leaf of T* , then CT*fx) — 0- By definition of 
^fc,min) we also have 

\\d{p{.\x),p{-\y))h,k = ||d(p(-|r*(x)),p(.|r*(y)))||L,fc > 4,min 

for any y G AZ^ with T*{y) ^ par(T*(x)). Hence the condition in Theorem 
13 is implied by: 



C2 

lognj '" ' '' x'^a: 



2{l + j--^]{l + c) max ||cf(T*(x))||L ^, < 4,: 



where the minimum can be taken over the (finite many) leaves of T* instead. 

D 

Proof of Theorem 16. Comets, Fernandez and Ferrari [10] have shown 
that processes with continuity rates A(6) = O (6~^~'^) and (^min > are ip- 
mixing with rate function 0{h~^). Since 93-mixing implies /3-mixing with 
the same rate function, we deduce from Assumption 4 that the processes 
X(l), . . . ,X{L) are /3-mixing with common rate function j3{h) = Th~^ [h G 
N) where P > depends on qmin and Pq- This brings us into the realm 
of Assumption 2. We also note for later use that our assumption on gmin 
implies: 



(F.6) Vu;G^*, VI <^<L : ^^(u;) > 



\w\ 



Now let h = [logn/[31og (l/gmin)]l and define T^ C A* as the tree con- 
taining all strings of length < h. Our next goal is to prove the following 
Claim. 

7-1 
Claim 4. Let 5q = hn 2 . Under the assumptions of the Theorem, 

Th gTso- 

Proof of the Claim. We need to show that: 



(F.7) TT^l < ^ 



5a)-+i n^ 



,[Ci(l + Q)(l + (6P)')]^+iy log^n 
(F.8) hT, + l < ''^'^'' 



i-h 



Ci{l + a) log^n 
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To prove (F.7), we note that 87 + 1 < 8(7 + 1) and, by definition of 6q, 

1 7 
(5o/L)t+^ n^+i = y/n. Hence we may underestimate the RHS of (F.7) by: 



n 



Ceil + a) log^n 

where Cq depends on T and 7 only. On the other hand, (F.6) apphed to the 
leaves of T^ gives: 

(F.9) TT^l < f ^V < (^) n^l^. 

The inequality (F.7) thus follows from: 

1 \^V3< ^^^ 



Ce (1 + a) log'^ n 

which is implied by (6.4) if C4 is large enough (this will only depend on F, 
7 and gmin)- 

We use the same upper bound on vr^ to check (F.8). Thus, the RHS can 
be lower bounded by: 

which is larger than h if (6.4) holds (if C4 is large enough). D 

We may now use T^ to upper bound the estimation error appearing in 
Theorem 12 since T ° yields a bound at least as good as T^. We first need 
bounds on Cf^^uy We have noted before the statement of Theorem 16 that 
the total variation distance between p£(-|r6(a;)) and 'pi(a\x) is at most \i{J3). 
By the triangle inequality, 

VI < £ < L, CT^^^,-) < ||{2AK/i)}Li||Lfc < 2ro/i-^-^ < T-%r- 

111^, K log' n 

with C5 as in the Theorem. On the other hand, since for all leaves w of T/^ 
we have 

,, ^ ^ n2/3 (C4)2/3(l + a)Mog2(7+2)^ 

y ymiii J \ yinin 

One may use this to check that 

sup ||cf(T;,(a;))|| =0, ^ 

• \log' n 
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where the asymptotics are for qmm-,1 and Fq fixed and n large. Increasing 
C5 if necessary, we conclude that for all x G ^Iqo- 

ct,(,) + max I ^ct, (.),(! + 2c) p(r;,(x)) 1 1 L_ J < ^^^^+1^ , 

and this implies the desired result via Theorem 12 and some further adjust- 
ments of C5. D 

APPENDIX G: TYPICALITY RESULTS FOR /3-MIXING PROCESSES 
In what follows we use 

r^{x) = min{6 G N : V6' > 6, /3(6') < x] {x e (0, 1)). 

Theorem 17. Let X_^ be a stationary and (i-mixing process over al- 
phabet A with mixing rate function /3(-)- Consider a non-empty finite set 
S <Z A* and define: 

hs = max|u;|, tts = min7r(u)) where 7r{w) = P [X^i 1 = w 

Let ^ > and 6 E (0, 1/e) and n G N satisfy: 
'Whs 



n > 2 



--'(^)}^{-S-'"'" 



then the random variables 

Nniw) = \{\w\ <j<n: X'._^^^^^ = w}\, w G S, 

satisfy: 

iryw) n ) 

Remark 9. In our application we will take ?^ > 3, ^ = 1/ log n, 6 > n~" 
and \S\ < n. In this case the condition on n in the Theorem is satisfied 
whenever: 

n > c (1 + o) f i^^^±^li^ V f i^ r ^ ^^ "' ' 



T^S J \ TTs \ 24 

where C > is universal. 
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Proof of Theorem 17. Consider a number b G N\{0}. Given r e N, 
a sequence B = {Bi, . . . , Br) € [n] of subsets of [n] is said to consist of 
6-separated blocks if each Bi is an interval in [n] and min -Bj+i > max Bi + b 
for 1 < i < r — 1. We say that such a sequence is t-regular if t = |i3i| = 
I-B2I ^ • • • ^ |i?j._i| > |-Bj,|. 

Lemma 6. Under the assumptions of the Theorem, let {Bi, . . . ,Br) be 
a sequence of b- separated t-regular blocks where t > 2\w\. Define for each 
w G S the number of occurrences of w that are contained in one of the 
blocks Bi: 

N{w) = \{j G [n] : 3i £ [r + 1], j,j + \w\ - 1 £ Bi and Xj+'"''"^ = w}\ 

and let Uyj denote the number of places where w may occur: 



Tlyj 



^(l^il - \w\ + 1)+ = {t- \w\ + 1) (r - 1) + {\Br\ - \w\ + 1) 



Given A > 0, let E[X) denote the event: 

N{w 




E{X) = {'iw£S. 
Then 



P(^(A)) > l-2|5|exp 



Proof of Lemma 6. Let Xb-^ ,■■■ , ^Br+i be a sequence of independent 
random variables where each X^. has the same distribution as X^. . Define 
N{w) and £'(•) in analogy with N{w) and E{-) (respectively). 

Our first major goal in the proof is to show: 

Claim 5. P {E{X)) > P (^(A/2)) - ^. 

To prove the claim we first construct a coupling of Xb^ , ■ ■ ■ , Xb^ to the 
process Xf^- Set Xbi = Xb^- Assuming that we have defined XBi for 
1 ^ ^ < i; we sample {Xb ,Xb ) from a coupling achieving total variation 
distance. That is to say, 

Xb, / Xb, \XB,,i< j) = sup |P {Xb, £E\XB^,i<j) -P {Xb, G E) 
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The b separation condition implies that Xb is b steps ahead into the future 
from Xsiji < j- Therefore, the /3-mixing condition imphes: 



Xb, / Xb, 



E 



sup |P {Xb^ eE\XB,,i<j)-F {Xb^ G E) 



< m- 



Now observe that in order for E(X) to hold it suffices that E{\/2) holds and 

that 

^^ ^ ^^ \N{w)-N{w)\ ^ A7r5_ 

Therefore we will be done once we show that: 



(G.l) 



Uyj 2 Xtts 



To do this, notice that for any w: 

r 

\Niw) - N{w)\ < Y^ i\B,\ - \w\ + 1)+ X{x,^^xs^y 
i=l 

This is because each block Bi may contain at most |Sj| — |ty| + 1 occurrences 
of w. 

The first r — 1 blocks have the same size \Bi\ = t, whereas the last one 
cannot be larger, hence n^ > {r — l)(t — \w\ + 1)+. Moreover, Xb^ = Xb^ 
always. We deduce: 



j=2 



J' --^j'' 



\N{w)-N{w)\ < {t-\w\ + lU E^{Xs,^x,^} < ^^ E^ix.^^x 
and 



E 



max ■ 



\N{w)-N{w)\ 



n„ 



< E 



max ■ 

wes 



\N{w)-N{w)\ 



n„ 



< 



E:=2^ [Xb, ^ Xb^ 
r — 1 



< m- 

We deduce from Markov's inequality that: 

\ wes n^) 2 Atts 
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and this is precisely (G.l), which finishes the end of the proof of the Claim. 
We must now bound P ( E{X) I . By the union bound, we have: 



(G.2) P (^(A^)) > 1 - 5Z 



wes 



N{w) 



n„ 



7r(u;) 



> 



X-k{w) 



Fix a w £ S. Let Ni(w) denote the number of occurrences of w in Bi. We 
will apply Bennett's inequality to the sum of these random variables. To 
this end we note that: 

2. The Ni{w) are independent. This is so because the X^. are indepen- 
dent. 

3. Ni{w) < {\Bi\ — {wl + 1)+ < b for all i because h is an upper bound on 

\B^\. 



4. E.E 



NAw] 



'K{w)nyj. 

5. ^^ V f Ni{w) J < -7t{w) n^ b. This is so because each Ni{w) is a sum of 

{\Bi{w)\ — \w\ + 1)+ < b indicators with variance 7r{w){l — 7r{w)) and 
the variance of a sum of < 6 terms is at most b times the sum of the 
variances (by Cauchy Schwarz). 

Therefore, 



iKrf \ f \ I ^ XiT{w)nw\ , „ 
\N[w) — TT[wj Uwl > ) < 2 exp 



A^ tt{w) n^ 
86(1 + t) 



Since t > 2h, 



yw eS,n^ = Y,i\B^{w)\ - \w\ + 1)+ > (r - 1) (i - l-u;] + 1) > 



(r-l)t 



and the result follows from plugging the probability inequality into (G.2) 
and applying the Claim. D 

From now on we set: 



(G.3) 
(G.4) 



Whs 



V/3- 



1 ( ^T^SS 

24 



n 
2b 



We now construct three sets of 6-separated blocks in [m]. The first one is: 
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1. B^^> = {B[ \ . . . ,Br ) consists of intervals of the form 

Sf ^ = {b{2i -2) + s : l<s<6}n [m]. 

These are the intervals of length b whose right endpoints are even 
multiples of b. These are 6-separated 6-regular blocks. 

2. B^"^^ = {B\ , . . . ,Br ) consists of intervals of the form 

Sf ^ = {b{2i -l) + s : l<s<b}n [m]. 
These are the intervals whose right endpoints are odd multiples of b. 

(2) (2) 

In this case we set B^ (w) = B\ for each 1 <i <r and w ^ S. This 
also results in 6-separated 6-regular blocks. 

3. ^(3) ^ (sf \ . . . , B^f_-^) consists of intervals 

Sf ^ = {bi - hs + 2,bi - hs + 3, . . . ,bi + hs} n [m]. 

This results in 6-separated 2/i5-regular blocks, as one can check. (Here 
one must use b > 2hs, which follows from (G.3) and the assumption 

e < 1/2.) 

For each k G {1,2,3}, let A^''^) [w) count the number of occurrences of w that 
are contained in a block of the form i?j- and let nlu = Xljd-^i I ~ 1^1 + 1)+ • 
We will need two propositions. 

Proposition 5. For any w £ S, 

N^^\w) + N^'^\w) < Nmiw) < N^^\w) + N^'^\w) + N^^\w). 

Proof. The LHS counts the number of occurrences of w contained in 
intervals of the form {bi + l,bi + 2, . . . ,b{i + 1)}. Since \w\ < hs, N^^'{w) 
is an upper bound on the number of occurrences of w that are not entirely 
contained in one of those intervals. D 

Proposition 6. For any w £ S, nw\nw > iLzMllIL^ ^(J) _|_ „(^) < ^ 
andn^^^ < ^. 

Proof. Since B'^^^ is 6-regular and r > n/2b (by G.4), 

l^^ , s ,, , ,x n n\w\ , n ( he b\ 

„L">(r-l)((,-H)>2-^-'.= 2(l-f--j. 
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By (G.3) b > Qhs/i and n > 66/^, so n'f'^ > (1 - C/3) n/2 as desired. The 

f2) ('3') 

same argument works for n\j, . For n^, we start from 2r — 1 < 2(n/26+ 1) — 
1 = n/26 + 1. Since each block contains at most 2hs points, 



n(3)<!^+2/.,. 



The rest follows from b/n < .^ < 1. 
Consider the events: 



(G.5) 
(G.6) 



qW 

^(2) 






Ar(i)( 



If 



"-UI 



n. 



(2) 



7r(i(;) 



vrty, 



< 



< 



.^7r(t(;) 
3 



(G.7) 
(G.8) 



G(3) . /v..5,^^^^!^<M^ 



n 



Claim 6. G C {Vt(; G S : \Nm{w) - ■K{w)n\ < ^7:{w)n}. 
Proof. Assume G holds. Then for any w £ S: 

Nm{w) > N^^\w) + N'^'^'> (w) 
(G occurs) > 7t{w) (l - ^) (n^^^ + n^^)) 



(Proposition 6) > 7r{w) ( 1 



e 



> (1 — .^) Tr{w) n. 



n 



On the other hand, 



< n{w) (l + I) 

< (1 + '^{w)n. 



(G occurs) < 7r(ty) I 1 + ^ ) '^ "I 1 '^ 



D 



D 
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The claim implies that: 

P (G) < P (Vu; G 5 : \Nrr,{w) - TT{w)n\ < C7^{w)n) 

and we proceed to bound P (G). 

We first apply Lemma 6 to each G^'^'. For k = 1,2 we may take t = b, 
X = C/3 and note that r > n/2b, j3{b) < X5tts/8 (cf. G.4, G.3) to deduce: 

(G.9) P ((G(i))^) + P ((G(2))^) < A\S\ exp | - ^'^^ ^" ~^^^ [ + ^ 

For G^'^^ we take t = 2hs and 

A = ^ 1 > 4 1 > ^ 



3 max^g5 rim ^hs 9hs 

since b > Whs/S, by (G.3). Notice that in this case A > 1, hence 



.2/.-, , ^..^^o^/.. ?& 



A7(l + A/6) > 3A/4 > 



12/15 
Moreover, /3{b)/TTsX < 6/6. Hence, we deduce: 

^ iiG^'^r) < 2|5| exp (-^^^M^ j^) + I 
<2|S|exp(i^^^ 

Now compare the exponential terms in the two equations. Since £ < 1/2, 



150 144 (1 + ^1 288 192 
> ^^ ^ > > 

e - e - e - r 

hence the exponential term in (G.IO) is larger than the exponential term in 
(G.9). We conclude that: 

^ 5 ,„, / r-l\ 



(G)<l-j;P (GW)^ >l---6|S|exp 



300 



fc=i \ i^T^s / 

To finish the proof, we recall that r > n/2& (cf. G.4) and notice that our 
assumptions imply: 

26- e^s \ S 

Plugging this back into the previous inequality gives P (G) > 1 — 5 and 
finishes the proof. D 
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